ON  FAST  ALGORITHM  AND  VLSI  DESIGN 
OF  FINITE  COMPUTATIONAL  STRUCTURES 


By 


ROM- S HEN  KAO 


A DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 
OF  THE  UNIVERSITY  OF  FLORIDA  IN 
PARTIAL  FULFILLMENT  OF  THE  REQUIREMENTS 
FOR  THE  DEGREE  OF  DOCTOR  OF  PHILOSOPHY 


UNIVERSITY  OF  FLORIDA 


1991 


Dedicated  to 


my  parents 


ACKNOWLEDGMENTS 


In  the  years  leading  to  this  dissertation,  I have  had  the  great  pleasure  of  working  for  and 
with  an  adviser  who  provided  an  environment  conducive  to  learning.  I would  like  to  take  this 
opportunity  to  thank  my  advisor,  Dr.  Fred  J.  Taylor. 

I would  also  like  to  thank  Dr.  Y.  C.  Chow,  Dr.  H.  Lam,  Dr.  M.  Law,  and  Dr.  J.  Principe  for 
serving  on  my  supervisory  committee. 

Special  thanks  to  my  fellow  students,  who  have  offered  useful  advice.  Thanks  go  to  Ah- 
mad Ansari,  Glenn  Dean,  Jun  Li,  Jon  Mellott,  Jeremy  Smith,  Eric  Strom,  Glenn  Zelniker,  and 
especially,  John  Ruckstuhl,  for  his  courtesy  and  professional  work  on  setting  up  the  SUN 
systems. 

I would  also  like  to  thank  those  I have  come  to  know  at  the  University  of  Florida.  They 
make  my  years  at  the  university  a unique  and  memorable  experience. 

This  work  would  not  be  possible  without  the  encouragement  and  support  from  my  wife, 
Jeannie. 

Finally,  I would  like  to  thank  my  parents  for  teaching  me  the  philosophy  of  life  and  the 
value  of  education,  and  most  of  all  for  their  love,  patience,  and  support. 


- in  - 


TABLE  OF  CONTENTS 


ACKNOWLEDGMENTS  iii 

LIST  OF  TABLES vii 

LIST  OF  FIGURES viii 

KEY  TO  SYMBOLS  AND  ABBREVIATIONS xiii 

ABSTRACT xv 

CHAPTERS 

1 INTRODUCTION 1 

1.1  Overview  1 

1.2  Problem  Statement  and  Objectives  2 

1.3  Dissertation  Organization  5 

2 FINITE  COMPUTATIONAL  STRUCTURES 8 

2.1  Introduction  8 

2.2  Algebraic  Foundations 8 

2.3  Modular  Arithmetic  and  Number  Theory  10 

2.3.1  Chinese  Remainder  Theorem  (CRT)  for  Integers 17 

2.3.2  Residue  Number  System  Arithmetic  (RNS) 17 

2.4  Residue  Polynomials 19 

2.4.1  Properties  of  Polynomial 19 

2.4.2  Convolutions  and  Polynomial  RNS  21 

2.5  Finite  Fields  Based  on  Polynomial  Rings  25 

2.5.1  Construction  of  the  Finite  Field  GF(  pm  ) 25 

2.5.2  Properties  of  the  Finite  Field  GF (pm  ) 29 

2.5.3  Bases  of  GF(  pm  ) over  GF(p) 32 

3 COMPUTATIONS  IN  FINITE  FIELDS 35 

3.1  Introduction  35 

3.2  Index  Calculus 35 

3.3  The  Table  Lookup  Method  for  Small  Fields 36 

- iv  - 


3.4  Discrete  Logarithm  and  Exponentiation  Problems  45 

3.5  Primal  Basis  Multiplier  49 

3.6  Dual  Basis  Multiplier 55 

3.7  Normal  Basis  Multiplier  60 

4 FINITE  STRUCTURE  TRANSFORMS  63 

4.1  Introduction  63 

4.2  Properties  of  Transforms  and  Some  Prime  Field  Transforms 65 

4.3  Complex  Fields  Transforms  and  Complex  Arithmetic  69 

4.3.1  Complex  Number  Theoretic  Transform  (CNNT)  70 

4.3.2  Complex  Residue  Number  System  (CRNS)  71 

4.3.3  Quadratic  Residue  Number  System  (QRNS)  71 

4.3.4  Extended  Algebraic  Integer  and  Polynomial  Residue  Number  System 

(PRNS) 73 

4.4  Transforms  and  Computations  in  Extension  Fields 80 

4.4.1  Index  Calculus  Complex  Residue  Number  System  (ICCRNS)  ....  80 

4.4.2  Finite  Extension  Field  Transforms  83 

5 FAST  DISCRETE  FOURIER  TRANSFORMS  OVER  FINITE  FIELDS 93 

5.1  Introduction  93 

5.2  Fast  Prime-Factor  Finite  Field  Transform 93 

5.2.1  The  Application  of  Normal  Basis  and  Conjugacy  Properties 99 

5.2.2  A Normal  Basis  Architecture  for  the  Finite  Field  Transform 100 

5.2.3  System  Performance  Analysis  and  Discussions 103 

5.3  The  Basis-Change  Algorithm  for  Fast  Finite  Field  Transforms  105 

5.3.1  The  Conjugate  Sets  in  a Fast  Finite  Field  Transform 106 

5.3.2  The  Basis-Change  Algorithm 107 

5.3.3  System  Development  109 

5.3.4  System  Complexity  Evaluation  and  Example Ill 

6 THE  PIPELINE  POLYNOMIAL  RNS  PROCESSOR  WITH  FERMAT  NUMBER 

TRANSFORM  115 

6.1  Introduction  115 

6.1.1  Reasons  to  Use  Fermat  Prime  Number  As  M 116 

6.1.2  Algebraic  Congruence  to  a Fermat  Prime 117 

6.1.3  The  Multiplier-Free  Fast  FFT-Type  Isomorphic  Mappings 119 

6.2  A Pipeline  Third-order  Polynomial  RNS  Processor  122 

6.3  System  Implementation  of  the  Third-order  Polynomial  RNS  Processor  . . 128 

6.3.1  A Nine-bit  PRNS  Processor  Development  on  the  HP-DCS 130 

6.3.2  A Five-bit  PRNS  Processor  Development  on  the  Magic  CAD  Tool  148 

7 CONCLUSION  AND  FUTURE  RESEARCH 167 

7.1  Conclusion 167 


- v - 


7.2  Future  Research 


173 


APPENDICES 

A ALGEBRAIC  FOUNDATIONS:  DEFINITIONS  AND  THEOREMS  176 

A.  1 GROUPS  176 

A. 2 RINGS  AND  FIELDS 177 

A.  3 THEOREMS  180 

B CIRCUIT  DESIGNS:  HP-DCS  AND  MAGIC  CAD  TOOLS 182 

B. l  The  HP-DCS  Schematics  of  the  Nine-Bit  Third-Order  PRNS  Processor  . 182 
B.2  The  HP-DVI  Timing  Diagrams  of  the  Nine-Bit  Third-Order  PRNS  Processor 

188 

B.3  The  Magic  Design  of  the  Five-Bit  Third-Order  PRNS  Processor 199 

B.4  CIF  and  Postscript  Files  of  a Two-Input  NAND  Gate 212 

B.4.1  CIF:  A Geometry  Language  212 

B.4. 2 PostScript:  A Geometry  Language 215 

BIBLIOGRAPHY  218 

BIOGRAPHICAL  SKETCH 223 


- vi  - 


LIST  OF  TABLES 


Table  2. 1 The  Elements  and  Their  Order  of  the  Multiplicative  Group  of  the  Ring  Z15  14 
Table  2.2  The  Possible  and  Actual  Order  of  the  Elements  in  the  Multiplicative  Group  of  the 


Ring  Z15  15 

Table  2.3  The  Elements  and  Their  Order  of  the  Field  Z 17 16 


Table  2.4  Elements  ft  = + aia~  + a\ a + ao  of  Fiia  ) Generated  by  f\{x)  ...  28 


Table  2.5  Elements  ft  = + (liar  + ci\ a + ao  of  F2(a  ) Generated  by  fi(x) . . 29 

Table  3. 1 Illustration  of  Indices  for  N = 9 36 

Table  3.2  The  Zech’s  Logarithms  in  GF(24 ) 39 

Table  4.3  Field  Elements  of  GF(24 ) Represented  Under  Primal,  Primitive  Normal,  and 
Normal  Bases  92 

Table  5. 1 The  Complexity  Comparison  Between  Conventional  and  Normal  Basis 

Approaches 103 

Table  6.1  The  Timing  Information  of  the  Simulation  of  the  PRNS  Processor 147 


- Vll  - 


LIST  OF  FIGURES 


Figure  2.1  Example  of  Abelian  Finite  Groups  9 

Figure  3.1  A Finite  Field  Arithmetic  Unit  Based  On  the  All-Table  Approach  ....  37 

Figure  3.2  A Discrete  Logarithm/Exponentiation  Multiplier 38 

Figure  3.3  A Logarithmic  Finite  Field  Arithmetic  Unit 41 

Figure  3.4  The  Building  Block  of  the  Finite  Field  Multipliers,  a)  Storage  element  such  as 
register  and  accumulator,  b)  Multiple  input  Modulop  adder.  c)  Modulo p multiplier  for  mul- 
tiplying a data  with  gi . d)  Modulo  p adder 49 

Figure  3.5  A Generic  Primal  Basis  aa  -Multiplier  50 

Figure  3.6  A Primal  Basis  Accumulated  aa^-Multiplier 52 

Figure  3.7  A Primal  Basis  Nested  ab  Multiplier  54 

Figure  3.8  A Generic  Dual  Basis  aa  -Multiplier 56 

Figure  3.9  A Dual  Basis  Accumulated  ao^-Multiplier 57 

Figure  3.10  A Dual  Basis  Summed  -Multiplier 59 

Figure  3.11  A Dual  Basis  Nested  ab  Multiplier 60 

Figure  3.12  A Cyclic  Shift  Register  for  Power  Forming  in  GF(pm  ) 61 

Figure  3.13  A Normal  Basis  Multiplier 62 

- viii  - 


Figure  4.1  The  Representation  of  Complex  Number  t in  Z[co  1 76 

Figure  4.2  The  Pictorial  Representation  of  the  Mh- order  Single  Modulus  PRNS 

System  

Figure  4.3  The  Cyclic  Shift  Property  of  Normal  Basis  Representation  of  V 91 

Figure  5.1  The  VLSI  Butterfly  Module  CB  of  a Cyclotomic  Coset  of  Length  / . . . 100 

Figure  5.2  The  s Stages  Fast  Galois  Field  Transform  System 101 

Figure  5.3  The  255-point  Good-Thomas  FFT  Pipeline  Stages 102 

Figure  5.4  The  Conjugate  Set  Evaluation  Circuit  108 

Figure  5.5  An  Example  of  the  Scaling  Shifter 110 

Figure  5.6  An  Example  of  the  Vector  Combiner 110 

Figure  5.7  The  Primal  Basis  Generator 1 1 1 

Figure  5.8  The  Normal  Basis  Generator 112 

Figure  6.1  The  Isomorphic  Forward  Transform  f4  124 

Figure  6.2  The  Isomorphic  Inverse  Transform  125 

Figure  6.3  The  Isomorphic  Mapping  Scalar  Modules  127 

Figure  6.4  The  Isomorphic  Inverse  Transform  F41 128 

Figure  6.5  System  Block  of  the  9-bit  Third-order  PRNS  Arithmetic  Unit  with 

FFT-Type  Fast  Isomorphic  Mappings  129 

Figure  6.6  Top  Level  Schematic  of  the  PRNS  System  131 

Figure  6.7  The  Hierarchical  Design  of  the  Third-Order  PRNS  Processor 132 

- ix  - 


Figure  6.8  The  Boolean  Equation  of  the  Module  - negmodp  134 

Figure  6.9  The  Multiwindow  Demonstration  of  FTI  Device  Selection  135 

Figure  6.10  The  Modulo  M Adder  136 

Figure  6. 11  The  MDL.vj  Unit  Implemented  Using  a Table  Lookup  Method  137 

Figure  6.12  The  Boolean  Equation  of  the  Module  - summodp 138 

Figure  6.13  The  Boolean  Equation  of  the  Module  - sumovck  139 

Figure  6.14  A Nine-bit  Modular  Adder 140 

Figure  6.15  The  Modulo  M Multiplier 141 

Figure  6. 16  The  Scalar  Module  RSQ 143 

Figure  6.17  The  Isomorphic  Forward  Mapping  Module 144 

Figure  6.18  The  Isomorphic  Inverse  Mapping  Module  145 

Figure  6.19  The  Logic  and  Timing  Simulation  Result  of  the  Third-Order 

PRNS  Processor  146 

Figure  6.20  Floorplan  of  the  Five-bit  Modular  Multiplier  149 

Figure  6.21  A Hierarchical  Design  of  the  Five-Bit  Modular  Multiplier 150 

Figure  6.22  Basic  CMOS  Cells  151 

Figure  6.23  A Two-Input  CMOS  NAND  Gate  Magic  Layout 152 

Figure  6.24  An  AND-OR  Programmable  Logic  Array  153 

Figure  6.25  A Pseudo-nMOS  Programmable  Logic  Array  Design  154 

Figure  6.26  Typical  Input  Protection  Circuit 155 

Figure  6.27  The  PLA  Transistor-Level  Layout  of  the  Negator  Cell  - palneg 156 

- x - 


Figure  6.28  A CMOS  Magic  Layout  of  the  Five-bit  Modular  Multiplier  158 

Figure  6.29  The  Embedded  Test  Circuitry  of  the  Five-bit  Modular  Multiplier  159 

Figure  6.30  The  Timing  Diagram  of  the  Operation  of  the  Modular  Multiplier  160 

Figure  6.31  Floorplan  of  the  Five-Bit  Polynomial  RNS  Processor  162 

Figure  6.32  A Hierarchical  Design  of  the  Five-Bit  Polynomial  RNS  Processor 163 

Figure  6.33  A Split-then-Add  Scaler  Module 164 

Figure  6.34  A Transistor  Layout  of  the  Scaler  Module  scalerRSQ  165 

Figure  B.  1.1  mux9  - The  Nine-Bit  Multiplexer 183 

Figure  B. 1.2  lat4b  - The  Four-Tuple  Latch 184 

Figure  B.  1.3  mul8x8  - The  Eight-Bit  by  Eight-Bit  Multiplier  185 

Figure  B.1.4  scalerNRSQ  - The  (- r1  )-Scaler  Module  186 

Figure  B.  1.5  scalerNR  - The  (-r  )-Scaler  Module  187 

Figure  B.2.1  The  Timing  Diagram  of  negmodp  - The  Negator  189 

Figure  B.2.2  The  Timing  Diagram  of  summodp  - The  Sum  Modulo  Converter  . . . 190 

Figure  B.2.3  The  Timing  Diagram  of  sumovck  - The  Sum  Overflow  Checker  ....  191 

Figure  B.2.4  The  Timing  Diagram  of  mux9  - The  nine-bit  multiplexer 192 

Figure  B.2.5  The  Timing  Diagram  of  add9modp  - The  Nine-bit  Modulo  Adder  . . . 193 

Figure  B.2.6  The  Timing  Diagram  of  mul8x8  - The  Eight-Bit  by  Eight-Bit 

Multiplier  194 

Figure  B.2.7  The  Timing  Diagram  of  mul9modp  - The  Nine-Bit  Modulo 

Multiplier  195 

- xi  - 


Figure  B.2.8  The  Timing  Diagram  of  forwmap  - The  Isomorphic  Forward  Mapping 

Module  196 

Figure  B.2.9  The  Timing  Diagram  of  backmap  - The  Isomorphic  Inverse  Mapping 

Module  197 

Figure  B.2. 10  The  Timing  Diagram  of  scalerNRSQ  - The  (- r 2 )-scaler  module  ....  198 

Figure  B. 3.1  The  Magic  Design  Circuit  Diagram  I 200 

Figure  B.3.2  The  Magic  Design  Circuit  Diagram  II  201 

Figure  B.3.3  The  Magic  Design  Circuit  Diagram  III 202 

Figure  B.3.4  The  Magic  Design  Circuit  Diagram  IV  203 

Figure  B.3.5  The  Magic  Design  Circuit  Diagram  V 204 

Figure  B.3.6  The  Magic  Transistor  Layout  of  multip  - The  Three  by  Three 

Multiplier  205 

Figure  B.3.7  The  Magic  Transistor  Layout  of  sumall  - The  Modular  Adder 206 

Figure  B.3.8  The  Magic  Transistor  Layout  of  smod  - The  Modular  Adder  for  the 

Isomorphic  Mappings 207 

Figure  B.3.9  The  Magic  Transistor  Layout  of  scln  - The  FFT-Type  Module  for  the 

Isomorphic  Mappings 208 

Figure  B.3.10  The  Magic  Transistor  Layout  of  fmap  - The  Isomorphic  Forward 

Mappings 209 

Figure  B.3. 1 1 The  Magic  Transistor  Layout  of  padin  - The  Input  Pad 210 

Figure  B.3.12  The  Magic  Transistor  Layout  of  padout  - The  Output  Pad  211 

Figure  B.4. 1 The  CIF  Description  of  a Two-Input  NAND  Gate 214 

Figure  B.4.2  The  PostScript  Description  of  a Two-Input  NAND  Gate 215 

- xii  - 


KEY  TO  SYMBOLS  AND  ABBREVIATIONS 


SYMBOL 

EXPLANATION 

DFT 

Discrete  Fourier  Transform 

FFT 

fast  Fourier  transform 

IFFT 

inverse  fast  Fourier  transform 

N 

the  set  of  natural  numbers  ( = positive  integers 

) 

Z 

the  set  of  integers 

Q 

the  set  of  rational  numbers 

R 

the  set  of  real  numbers 

C 

the  set  of  complex  numbers 

Si  x ...  x Sn 

the  set  of  all  ^-tuples  ( si, . . .,  sn ) where  st  e 

Si 

for  1 ^ i 

<>  n 

Sn 

the  set  of  all  /i-tuples  ( si, . . .,  s„)  where  si  e 

Si 

for  1 ^ i 

<.  n 

\s\ 

the  order  or  cardinality  ( =number  of  elements 

) of  the  finite 

set  S 

Ld 

the  greatest  integer  < t e R 

gcdUi kn) 

the  greatest  common  divisor  of  k\ , . . .,  kn 

lcm(^i , . . .,kn) 

the  least  common  multiple  of  k\ , . . kn 

0) 

binomial  coefficient 

<P(n) 

Euler’s  function  of  n 

At 

the  transpose  of  the  matrix  A 

det(A) 

the  determinant  of  the  matrix  A 

<a> 

the  cyclic  group  generated  by  a 

(a) 

the  principal  ideal  generated  by  a 

- xiii  - 


r:j 

the  residue  class  ring  of  the  ring  modulo  the  ideal  J 

Z „ 

the  group  of  integers  modulo  n 

z /(«) 

the  ring  of  integers  modulo  n 

the  polynomial  ring  over  the  ring  R 

deg(/) 

the  degree  of  the  polynomial/ 

Qn(x) 

the  mh  cyclotomic  polynomial 

K(M) 

the  extension  of  K obtained  by  adjoining  M 

[L:K] 

the  degree  of  the  field  L over  K 

Fp,GF(p) 

the  finite  field  of  order  p 

n 

the  multiplicative  group  of  nonzero  elements  of  Fp 

Tr  (a) 

the  trace  of  the  element  a.  of  the  field  F 

N(a) 

the  norm  of  the  element  a of  the  field  F 

logb(a) 

the  discrete  logarithm  of  a with  respect  to  the  base  b 

exp b ( r ) 

the  discrete  exponential  function  of  r with  respect  to  the  base  b 

♦ 

end  of  proof,  end  of  example,  end  of  remark 

- XIV  - 


Abstract  of  Dissertation  Presented  to  the  Graduate  School 
of  the  University  of  Florida  in  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of  Doctor  of  Philosophy 


ON  FAST  ALGORITHM  AND  VLSI  DESIGN 
OF  FINITE  COMPUTATIONAL  STRUCTURES 


By 

Rom-Shen  Kao 
May  1991 


Chairman:  Dr.  Fred  J.  Taylor 

Major  Department:  Electrical  Engineering 

This  dissertation  examines  some  properties  of  the  finite  computational  structures  such 
as  Finite  groups,  rings,  and  fields,  and  extends  its  applications  to  develop  fast  algorithms  and 
to  build  high-speed  low-complexity  very  large  scale  integrated  circuit  (VLSI)  systems  for 
digital  signal  processing. 

To  achiev  e high  speed  signal  processing,  the  finite  computational  structures  are  receiv- 
ing growing  attention  because  of  its  ability  to  support  high-speed  integer  arithmetic.  In  this 
dissertation,  various  fundamental  algebraic  theories  are  applied  to  the  development  of  a fi- 
nite computational  system.  A finite  polynomial  ring  structure  is  addressed  to  expedite  and 


- xv  - 


simplify  the  multiplication  of  polynomials.  The  applications  of  this  polynomial  structure  are 
in  the  areas  of  short-block  length  cyclic  convolution  and  complex  number  arithmetic  which 
originates  from  the  concept  of  algebraic  integer  approximation.  A new  algorithm  of  finite 
field  transforms  with  cyclic  convolution  property  (CCP)  is  investigated.  This  algorithm 
combines  abstract  algebra  concepts  which  includes  normal  basis  representation  and  conju- 
gacy  property  with  factor-type  fast  Fourier  transform  algorithms  (FFT)  to  expedite  finite 
field  transforms.  Finally,  a detailed  design  of  a transform  butterfly  module  and  a third-order 
polynomial  RNS  (PRNS)  processor  are  presented.  It  is  shown  that  compact,  economical,  and 
high-performance  design  is  feasible  when  based  on  a VLSI  architecture. 


- xvi  - 


CHAPTER  1 
INTRODUCTION 

1.1  Overview 

The  theory  of  the  finite  computational  structures,  finite  groups,  rings,  or  fields,  is  a 
branch  of  modern  algebra  which  has  climbed  to  the  forefront  of  investigational  activity  dur- 
ing the  past  50  years.  Reflecting  the  work  of  such  eminent  mathematicians  as  Euclid,  Pierre 
de  Fermat,  Leonard  Euler,  Joseph-Louis  Lagrange,  Carl  Friedrich  Gauss  and  Evariste 
Galois,  this  field  of  study  contains  diverse  applications  in  combinatorics,  coding  theory,  and 
mathematical  studies  of  switching  circuits.  A flurry  of  recent  activity  which  ensued  in  the 
development  of  interesting,  computationally  efficient  algorithms  coupled  with  the  advent  of 
high-speed  high-density  digital  computing  devices  make  the  field  well  suited  to  applications 
in  data  communications,  error  control  codes,  speech  processing,  radar/sonar  signal  process- 
ing, and  image  processing. 

Arithmetic  operations  in  finite  computational  structures  play  an  important  role  in  con- 
volving and  transforming  digital  signal,  as  well  as  in  encoding/decoding  error  control  codes, 
where  sampled  data  indices  or  amplitudes  are  treated  in  a number-theoretic  manner.  Modular 
arithmetic  utilizes  mathematical  operations  involving  multiplication,  power-raising,  multi- 
plicative inversion,  convolution,  and  transform.  In  terms  of  application,  these  operations  are 
among  the  most  complex  and  time-consuming.  Hence,  implementation  of  a fast  algorithm 
which  serves  to  minimize  either  the  cost  or  the  computation  time  of  the  circuits  required  to 
perform  these  operations  results  in  a considerable  improvement  within  the  field  of  applica- 
tions. 

W hile  some  computational  algorithms  are  more  popular  than  others,  users  generally 
attribute  a higher  value  to  the  most  efficient  version  of  these  algorithms.  One  concept  central 


2 


to  many  of  these  algorithms  is  the  Chinese  remainder  theorem  (CRT)  in  residue  number  sys- 
tem (RNS),  a method  which  dates  back  to  antiquity  and  involves  the  partitioning  of  a large 
problem  into  a number  of  smaller  subproblems.  By  strengthening  and  reestablishing  the 
threads  spun  from  classical  mathematics,  CRT  opened  numerous  applications  for  applied 
mathematicians  and  arithmetic  complexity  theorists,  and  research  took  a necessary  step  to- 
ward digital  signal  processing  (DSP).  Another  DSP-related  algorithm  is  number  theoretic 
transform  (NTT)  which  is  defined  over  finite  fields  and  rings  of  integers  with  all  arithmetic 
performed  modulo  an  integer.  These  transforms  which  have  cyclic  convolution  property 
(CCP)  are  shown  to  be  advantageous  to  digital  convolution  based  applications  such  as  digital 
filtering,  correlation  studies,  and  the  multiplication  of  very  large  integers. 

The  primary  components  of  a DSP  system  include  a multiplier  or  multiplier/accumula- 
tor, program  and  data  memory',  and  a controller  with  some  form  of  register  files.  Since  high 
levels  of  integration  are  the  current  trend  in  system  design,  the  existing  general  purpose  pro- 
cessors may  not  meet  new  demands.  In  recent  years,  the  progress  of  Very  Large  Scale  Inte- 
grated circuit  (VLSI)  technology  has  been  widely  applied  in  signal  processing.  In  the  custom 
design  arena,  higher  density  and  more  sophisticated  gate-array  designs  realize  the  dream  of 
placing  entire  systems  on  a single  chip  which  can  contain  on  the  order  of  100,000  logical 
gates.  Theory  of  algorithms  produce  a means  of  efficiently  organizing  these  gates.  Thus,  the 
integration  of  fast  arithmetic  and  mathematical  algorithms  for  signal  processing  is  a key  fac- 
tor in  effecting  a break  through  of  the  restricted  development  in  electronic  devices  technolo- 
gy. This  dissertation  introduces  novel  concepts  in  number  system  application  and  discusses 
the  development  of  their  associated  systems. 

1.2  Problem  Statement  and  Objectives 


Applications  in  digital  filtering,  correlation  studies,  radar  matched  filtering,  and  the 
multiplication  of  very  large  integers  are  based  on  digital  convolution,  which  can  be  implem- 
ented most  efficiently  by  NTT  with  some  constraints.  The  arithmetic  required  to  accomplish 


-3- 


the  NTT  is  exact  and  involves  additions,  subtractions,  and  bit  shifts.  As  in  the  case  of  DFT. 
fast  algorithms  exist  for  the  NTT.  The  family  of  NTT  includes  Fermat,  Mersenne,  Rader. 
pseudo-Fermat,  pseudo-Mersenne,  complex  Mersenne,  and  complex  Fermat  transforms 
[McC79,  E1182],  They  are  truly  digital  transforms  and  their  implementation  involves  no 
round-off  error.  The  implementation  of  NTT  required  multiplications  (scaling)  and  addition- 
s/subtractions is  such  a time-consuming  (or  hardware  intensive)  task  that  any  advantage  in- 
herited from  the  cyclic  convolution  property  is  eventually  offset. 

Recent  attention  has  been  focused  on  situations  requiring  data  manipulation  over  com- 
plex fields.  The  problem  of  approximating  complex  numbers  by  elements  of  algebraic  inte- 
gers also  has  been  investigated  by  many  researchers[Gam85,  Coz85].  The  operations  in  the 
ring  of  algebraic  integers  are  translated  to  the  operations  of  polynomials  by  the  polynomial 
residue  number  systems  (PRNS).  The  modular  number  system  admits  an  unusual  represen- 
tation of  complex  data,  which  leads  to  a complete  decoupling  of  the  real  and  imaginary  chan- 
nels, thereby  simplifying  complex  multiplication  and  providing  error  isolation  between  the 
channels.  Unfortunately,  isomorphic  mappings  between  complex  number  and  PRNS  do- 
mains suffer  from  a nontrivial  transform  problem  which  eventually  precludes  the  inherent 
advantages  of  the  PRNS  approach. 

The  maturing  of  silicon  technology  and  the  sophisticated  CAD/CAE  tools  have  al- 
lowed for  the  economical  development  of  semicustom  and  custom  application-specific  inte- 
grated circuit  (ASIC)  chips.  The  concurrence,  modularity,  and  regular  interconnection  of  the 
finite  computational  structures,  such  as  the  RNS  system,  are  the  main  properties  of  an  effi- 
cient VLSI  system.  Taking  all  these  merits,  Taylor  and  Kao[Kao87]  have  developed  a pow- 
erful complex  ALU,  which  is  based  on  a quadratic  RNS  (QRNS),  and  proven  the  concept 
using  4 1 87  gates  of  the  GE  gate-array  design.  Based  on  these  observations,  the  coupling  of 
the  finite  computational  structures  with  the  recent  advances  in  VLSI  technology  suggests  an 
efficient  implementation  of  many  signal  processing  algorithms. 


-4- 


While  DSP  researchers  have  responded  to  requirements  for  high  speed  and  low  cost, 
some  levels  of  parallelism  and  low  complexity  modules  are  of  primary  concern.  The  trend  of 
VLSI  system  design  utilizes  high  levels  of  integration  based  on  efficient  algorithms.  By  in- 
vestigating signal  processing  algorithms  and  mathematical  theory,  their  close  relationship 
looks  so  promising  that  a fundamental  research  effort  could  no  longer  be  ignored.  However,  a 
successful  investigation  of  this  relationship  cannot  be  accomplished  w ithout  addressing  the 
following  problems: 

1)  The  construction  of  fast  algorithms  for  the  applications  of  real-time  signal  pro- 
cessing from  basic  mathematical  theories  such  as  number  theory  and  abstract  alge- 
bra; 

2)  The  development  of  an  efficient  transform  method  of  mapping  complex  numbers 
to  a new  representation  which  provides  parallelism  and  modular  arithmetic  capabili- 
ty: 

Conventional  complex  signal  processing  requires  the  formulation  of  real  and 
imaginary  channels  and  the  special  handling  of  cross-product  terms  for  complex 
multiplication.  The  efficient  implementation  of  complex  digital  arithmetic  has  be- 
come increasingly  important  because  many  signal  processing  applications  require 
processing  of  complex  signals  with  complex  digital  filters; 

3)  The  development  of  techniques  to  reduce  the  computational  complexity  of  finite 
field  transforms: 

With  the  advantages  of  truly  digital  and  roundoff-free  transforms,  the  imple- 
mentation of  a cyclic  convolution  system  becomes  a promising  task.  However,  the 
finite  field  transform  is  still  an  awkward  mapping  without  special  treatments;  and 

4)  The  development  of  a highly  integrated  arithmetic  system  w'hich  is  equipped  with 
fast,  compact  decomposed  channels: 

Speed  and  density  always  are  conflicting  factors  in  integrated-circuits  (IC) 
technology.  It  seems  advantageous  to  construct  arithmetic  systems,  which  provide 


-5- 


the  highest  performance  in  terms  of  cost,  speed  and  packaging,  by  implementing 
high-speed  lower  density  devices. 

The  resolution  of  these  problems  involves  the  primary  objectives  of  this  research, 
which  are  as  follows: 

1)  Achieve  effective  and  fast  algorithms  of  modular  arithmetic  and  finite  field  opera- 
tions to  accomplish  the  development  of  polynomial  RNS  and  finite  field  transform; 

2)  Apply  the  theories  and  algorithms  to  the  development  of  a high-throughput  low- 
complexity  finite  field  transform  system  for  digital  convolution  applications,  and  a 
simple  isomorphic  mapping  pipeline  polynomial  RNS  machine  for  complex  arithmetic 
applications;  and 

3)  Develop  a systematic  methodology  for  the  efficient  VLSI  implementation  of  the 
developed  architectures. 


1-3  Dissertation  Organization 

This  dissertation  presents  both  the  fast  algorithm  and  the  VLSI  design  as  a solution  to 
the  problems  of  arithmetic  computations,  circular  convolutions,  and  the  discrete  Fourier 
transform  in  finite  computational  structures  with  an  emphases  on  finite  fields  or  the  so-called 
Galois  fields.  The  work  is  principally  based  on  modem  algebra;  while  some  of  the  well 
known  theorems  are  shown  without  proof,  most  of  the  definitions  are  illustrated  in  the  appen- 
dix for  reference. 

The  mathematical  framew'ork  from  which  this  research  effort  stems  is  presented  in 
Chapter  2.  This  section  focuses  on  the  mathematical  preliminaries  which  include  modular 
arithmetic,  the  Chinese  remainder  theorem,  residue  polynomial  theory,  and  the  finite  compu- 
tational structures  of  groups,  rings  and  fields.  The  materials  covered  in  Chapter  2 are  based 
on  text  book  discussions  provided  by  McClellan  and  Rader[McC79],  Lidl  and  Nieder- 
reiter[Lid86],  and  MacWilliams  and  Sloane[Mac77],  In  order  to  reinforce  the  concepts  of 
number  theory  principles,  concrete  examples  are  inserted  at  appropriate  points  in  the  text. 


-6- 


Computations  in  finite  fields  are  developed  in  Chapter  3 which  begins  by  presenting 
table-lookup  methods  for  small  finite  field  arithmetic  involving  discrete  logarithm  and  ex- 
ponentiation problems,  while  the  nontrivial  addition  operation  in  logarithm  form  of  finite 
fields  is  solved  by  applying  Zech’s  logarithm  method.  Since  the  memory  lookup  method  ap- 
pears unrealistic  for  large  fields,  the  discussion  turns  to  an  investigation  of  the  sequential  and 
parallel  design  approaches.  These  designs  are  based  on  various  basis  representations  of  finite 
field  elements  which  include  primal,  dual,  and  normal  basis  representation.  The  normal  basis 
structure  is  suggested  as  the  most  efficient  model  for  calculation  of  the  fast  finite  field  trans- 
form. The  chapter  concludes  with  an  investigation  of  the  cyclic  convolution  property  within 
a finite  field. 

Chapter  4 demonstrates  various  transforms  within  finite  computational  structures  us- 
ing number  theoretic  concepts  for  error- free  and  fast  computation.  A brief  description  of  Fer- 
mat number  transform  (FNT)  and  Mersenne  number  transform  (MNT)  over  integer  rings  are 
presented.  These  transforms  are  also  applicable  overextension  fields  which  include  the  sec- 
ond-order extension  field,  GF(  P~ ),  and  the  higher-order  extension  field,  GF(  pm  ).  A signifi- 
cant feature  of  the  RNS  is  the  ability  to  perform  complex  arithmetic  efficiently.  Several  vari- 
ations of  the  RNS  for  complex  integers  such  as  the  most  notably  quadratic  RNS  and  its  ex- 
tended version,  polynomial  RNS  are  investigated.  Finally,  finite  field  transforms  and  their 
algebraic  properties  are  developed.  This  entire  section  sets  the  ground  work  for  the  system 
developments  found  in  later  chapters. 

Drawing  upon  the  results  of  modular  arithmetic,  complex  arithmetic  based  on  exten- 
sion fields,  and  the  finite  field  transforms  all  presented  in  previous  chapters,  systems  of  the 
polynomial  RNS  and  fast  finite  field  transform  are  developed  in  the  following  chapters.  Par- 
allel to  the  conventional  discrete  Fourier  transform  (DPT)  in  complex  number  fields,  Chap- 
ter 5 introduces  another  system  which  details  the  involvement  of  a fast  algorithm  in  comput- 
ing DFTs  within  a finite  field.  This  is  accomplished  by  using  the  cyclotomic  coset  properties 
of  finite  fields  to  the  intermediate  stages  of  the  transform  iterations.  Further  computational 


-7- 


savings  in  the  transform  are  obtained  by  applying  the  cyclic  shift  to  the  evaluated  element  in  a 
cyclotomic  coset,  where  elements  are  represented  in  normal  basis.  During  the  sum-of-pro- 
duct  operations  within  the  evaluation  of  a coset  leader  of  the  intermediate  stages,  the  field 
elements  cannot  maintain  within  the  normal  basis  representation.  Thus,  a novel  basis-change 
algorithm  is  applied  to  expedite  the  evaluation  of  the  sum-of-products  and  maintain  the  inter- 
mediate result  in  the  normal  basis  representation. 

An  in-depth  discussion  of  previous  theories  regarding  the  implementation  of  the  poly- 
nomial RNS  arithmetic  in  the  VLSI  system  along  with  their  subsequent  performance  analy- 
ses is  provided  in  Chapter  6.  Design  models  considered  for  the  pipeline  third-order  polyno- 
mial RNS  processor  based  on  Fermat  number  transform  include  the  multiplier-free  iso- 
morphic mappings  with  fast  Fourier  transform  expedition  and  the  high  speed  modular  multi- 
plier. In  terms  of  logic  simulation  and  timing  analysis,  verification  of  this  work  is  provided  by 
the  HP  Design  Capture  System.  Furthermore,  a silicon  layout  of  the  5-bit  cubic  polynomial 
RNS  processor  is  implemented  on  the  MAGIC  computer-aided  design  (CAD)  tool  which  is 
commensurate  with  the  2-micron  double-metal  complementary  metal  oxide  semiconductor 
(CMOS)  technology. 

Chapter  7 presents  the  conclusions  readily  established  by  this  research  effort  and  offers 
a sense  of  direction  for  future  research  endeavors. 


CHAPTER  2 

FINITE  COMPUTATIONAL  STRUCTURES 
2.1  Introduction 

This  chapter  investigates  the  fundamental  properties  of  finite  computational  structures 
pertaining  to  groups,  rings  and  fields.  The  discussion  begins  by  exploring  the  concepts  of 
congruence  and  residue  reduction.  The  foundations  of  number  theory  stemming  from  Eul- 
er’s theorem  and  the  concept  of  primitive  roots  are  also  introduced.  Discussion  of  the  Chi- 
nese remainder  theorem  in  relation  to  both  integers  and  polynomials  leads  to  the  realization 
that  polynomial  algebra  has  a close  relationship  to  the  digital  convolution.  The  mathematical 
application  of  this  relationship  to  a polynomial  over  fields  induces  the  construction  of  finite 
extension  fields.  The  properties  of  finite  field  such  as  power  forming  to  a characteristic, 
minimal  polynomials,  conjugates,  and  the  basis  representation  of  a field  element  in  an  exten- 
sion field  over  its  ground  field  are  also  introduced. 

2.2  Algebraic  Foundations 

The  trend  of  VLSI  system  design  utilizes  high  levels  of  integration  based  on  efficient 
algorithms.  Logically  sound  algorithms  can  be  viewed  as  elegant  algebraic  identities.  To 
construct  these  algorithms,  one  must  be  familiar  with  the  powerful  structures  of  number 
theory  and  of  modem  algebra.  The  structures  containing  the  set  of  integers,  polynomial 
rings,  and  finite  fields  play  an  important  role  in  the  design  of  signal  processing  algorithms. 
For  instance,  although  it  is  most  familiar  within  a complex  field,  a discrete  Fourier  transform 
can  be  defined  in  any  field.  In  order  to  gain  a better  insight  of  the  mathematical  theories 


-8- 


-9- 


which  lead  to  this  application,  a brief  review  of  finite  mathematical  structures  is  provided 
along  with  a discussion  of  integer  rings,  residue  polynomials,  and  finite  fields. 

Groups.  A group  is  defined  as  a mathematical  abstraction  of  an  algebraic  structure 
which  appears  frequently  in  various  concrete  forms.  In  general,  let  5 be  a set  and  let  5 x 5 

denote  the  set  of  all  ordered  pairs  ( s,  t ) with  s,  t e 5.  Then,  a mapping  from  5x5  into  5 is 
called  a binary  operation  on  5.  The  closure  property  of  an  operation  requires  that  the  image  of 
( s,  t ) must  be  in  5.  A group  G is  a set  which  possesses  a binary  operation  satisfying  the  prop- 
erties of  associativity,  identity,  and  inverses  (APPENDIX  A Dl.l).  Furthermore,  Groups 
with  commutative  property  are  called  commutative  group  or  abelian  groups.  A group  that 
has  a finite  number  of  elements  is  defined  as  a finite  group.  The  number  of  elements  in  a finite 
group  G is  called  the  order  of  G,  denoted  by  IGI.  An  example  of  finite  abelian  groups  are 
shown  in  Figure  2.1.  Although  these  groups  are  represented  by  two  different  notations,  it  is 


+ 

0 

1 

2 

3 

4 

e 

Si 

82 

S3 

8 4 

0 

0 

1 

9 

3 

4 

e 

e 

Si 

82 

S3 

S4 

1 

1 

2 

3 

4 

0 

Si 

Si 

82 

S3 

s< 

e 

2 

2 

3 

4 

0 

1 

82 

82 

S3 

S« 

e 

Si 

3 

3 

4 

0 

1 

2 

83 

82 

S4 

e 

Si 

82 

4 

4 

0 

1 

2 

3 

8 4 

8 4 

e 

Si 

82 

S3 

Figure  2.1  Example  of  Abelian  Finite  Groups 


essentially  the  same  group  shown  twice.  In  general,  any  two  algebraic  systems  having  the 
same  structure  but  different  notations  are  called  isomorphic. 

A group  is  defined  as  cyclic  when  all  elements  of  the  group  can  be  generated  from  one 
element  of  the  group  in  the  form  a , a*a,  a*a*a, ... , where  * denotes  the  binary  operation  in 
the  group.  All  cyclic  groups  of  the  same  order  are  isomorphic  to  one  another. 


- 10  - 


Rings.  A ring  R is  an  abstract  set  that  is  an  abelian  group  having  the  property  of  per- 
forming another  operation.  Formally,  a ring  (/?,+,•)  is  a set  R which  contains  two  binary 
operations,  such  that:  R is  an  abelian  group  with  respect  to  ‘+\  the  operation  is  associa- 
tive, and  the  distributive  laws  hold  (APPENDIX  A D2.1).  A commutative  ring  is  one  in 
which  the  operation  ‘-’is  commutative,  also  a ring  with  identity  is  one  that  has  an  identity 
under  operation  *•’.  In  its  least  restrictive  sense,  an  abelian  group  is  a set  in  which  one  can 
‘add,’  ‘subtract,’  and  ‘multiply.’ 

Fields.  A more  powerful  algebraic  structure,  known  as  a field,  is  a set  in  which  one  can 
‘add,’  ‘subtract,’  ‘multiply,’  and ‘divide.’  A commutative  ring/?  with  identity  is  called  a field 
if  and  only  if  nonzero  elements  in  R form  a group  under  multiplication.  A field  with  a finite 
number  of  elements  q is  called  a finite  field  or  a Galois  field  (after  its  discoverer)  and  is  de- 
noted by  GF(  q ).  In  the  sequel  p denotes  an  odd  prime  and  q = pm  denotes  a prime  power.  Fq 
or  GF(  q ) describes  a finite  field  of  q elements;  Fq  describes  its  group  of  nonzero  elements; 
and  F?[x]  describes  the  polynomial  ring  in  one  indeterminate.  The  following  sections  ex- 
amine the  finite  computational  structure  which  contains  the  ring  of  integer  Zm  with  respect 
to  modulus  M , the  field  of  integer  GF(  p ) with  respect  to  prime  modulus  p,  and  the  Galois 
field  GF(  q ) of  pm  elements. 


2.3  Modular  Arithmetic  and  Number  Theory 


This  section  discusses  several  basic  concepts  of  modular  arithmetic  pertaining  to  the 
application  of  number  theory  to  finite  computation  structures.  Number  theory  for  signal  pro- 
cessing begins  with  the  elementary  idea  of  division. 


- 11  - 


Modular  arithmetic.  Let  a and  M be  any  two  integers  — provided  M is  positive.  Then,  a 
can  be  divided  by  M to  get  a quotient  and  a remainder,  and  numbers  k and  b which  satisfy  both 
the  equation 

a = kM  + b (2.1) 

and  the  inequality 

0 ^ b < M.  (2.2) 

can  be  found.  The  integers  a and  b are  said  to  be  congruent  mod  M\ia-b  = kM,  where  k is 
some  integer  and  M is  the  modulus.  This  is  written  as 

a = bmodM  . (2  3) 

Equation  (2.3)  also  serves  as  the  necessary  and  sufficient  condition  for  the  integer  a and  b to 
belong  to  the  same  residue  class  mod  M.  All  integers  are  congruent  mod  M to  some  integer  in 
the  set  { 0, 1, . . . , M-  l ) which  is  a set  of  M integers  representing  all  the  residue  classes  mod 
M.  The  set  is  called  a residue  system  mod  M or  the  ring  of  integers  mod  M,  and  is  denoted  by 
Z.w  • If  within  a ring  of  integers  the  multiplicative  inverses  a-1  exist  for  all  nonzero  integers 
a,  that  is  aa-1  = 1 modM,  then  this  ring  becomes  a field.  Note  that  Z iW  is  a field  if  and  only  if 
M is  a prime.  Because  of  the  nature  of  modular  arithmetic,  numbers  have  neither  sizes  nor 
magnitudes.  For  instance,  the  following  basic  operations  are  permissible  with  modular  arith- 
metic. They  are  addition  ( e.g.  15  + 5 = 20  = 3 mod  17  );  negation  ( e.g.  -5  = 17  — 5=  12  mod 
17);  subtraction  ( e.g.  15  -5  = 15  + (-5)  =15  + 12  mod  17  = 10 mod  17 );  multiplication  ( e.g. 
15  x 5 = 75  = 7 mod  17  );  inverse  ( e.g.  7 = 5-1  mod  17,  because  7x5  = 1 mod  17  );  and 
division  - provided  that  a/b  exists  if  and  only  if  b has  an  inverse,  then  a / 6 = a fcr1  ( e.g.  1 5 / 5 
= 15  x 5_1  =15x7  mod  17  = 3 mod  17  ). 

Basis  number  theory.  For  a transform  having  DFT  structure,  it  is  necessary  to  intro- 
duce an  integer  a which  represents  the  Mh  root  of  unity  ( i.e.  <rv  = 1 ).  The  classic  study  of 


-li- 


the problem  is  developed  using  Euler's  phi-function  (p  which  is  considered  as  a function  of 
the  integer  variable  M,  denoted  as  <p  ( M ).  This  function  represents  the  number  of  integers  in 
Z a*  that  are  relative  primes  to  M.  For  M a prime,  <p  (M)  = \1  - 1.  Furthermore,  for  V/  is  a 
power  of  a prime,  pm,  then  the  only  elements  of  Zv/  which  are  not  relatively  prime  to  pm  are 

the  pm~x  multiples  of  p , therefore  <p  (pm)=  pm  - p m~1  = pm~x  ( p - 1 ). 

For  general  case  that  M is  a composite  number,  the  fundamental  theorem  of  arithmetic 


states  that  M has  the  following  unique  prime  power  factorization 

M = Px  ■ Pi  ■ ■ ■ Pn  , (2.4) 

then  the  general  expression  for  <p  {M)  becomes 

<pm  = A/  • (1  - \/P\)  • (1  - l/p2)  • . . (1-1  /Pn).  (2.5) 

If  M is  only  a prime-factor,  which  means  eL  = 0 for  i = 1,  2, . . .,  n,  then 

<p(M)  = Oi  — 1) * (p2  — 1)  • • • (Pn-l).  (2.6) 

Euler’s  theorem  states  that  for  every  a relatively  prime  to  M 

a<P(M)  = imodM.  (2.7) 

For  M prime,  this  reduces  to  Fermat’s  theorem 

a,w_1  = ImodM  (2.8) 


which  holds  for  all  nonzero  elements  of  Z,v/ . In  the  field  Zm  there  are  certain  roots  of  unity 
that  are  of  particular  interest.  If  N is  the  least  positive  integer  such  that 

a1  = ImodA/,  (2.9) 

then  a is  said  to  be  a root  of  unity  of  order  N (or  simply  of  order  AO  and  is  denoted  as  0\/a  = 
N.  Using  another  terminology,  the  root  a is  said  to  be  a primitive  Nth  root  of  unity.  For  the 
case  of  O mcc  = <p  (A/),  a is  called  a primitive  root  mod  M.  If  M is  prime  and  a is  a primitive 


- 13- 


root,  then  all  nonzero  integers  in  Z.v/  can  be  generated  by  powers  of  a .This  characterizes  the 
entire  field. 

Euler’s  theorem  implies  that  if  a is  of  order  N then  N must  divide  <p  (A  f),  denoted  by  N I 
<P  (A/).  The  initial  implication  suggests  that  all  the  possible  order  of  roots  are  divisors  of 
(p  (A/);  however,  the  theorem  does  not  state  that  for  every  divisor  of  <p  (Af)  there  are  roots  cor- 
responding to  it.  Nevertheless,  the  phi-function  satisfies  the  property 

0(A/)  = <p{N\)  + <p(N2 ) + . . . + <p(Ns )5  (2.10) 

where  A/,  is  a divisor  of  M including  1 and  M.  The  following  theorem  states  the  relationship 
of  order  between  roots. 

Theorem  2,1  If  the  order  of  a is  N,  then  the  order  of  ak  is  N/  gcd(  N,  k ). 

Proof:  For  integers  N and  k,  let  N = mn,  k = ml,  where  n and  / are  relatively  prime.  Then, 
gcd(  N,  k ) = m and  lcm(  N,k)  = nlm.  Thus, 

Nk  = m2nl  = gcd(  N,  k ) lcm(  N,  k ).  (2.H) 

Assume  e is  the  order  of  ak , that  is  ( ak  )e  = 1 , which  indicates  e is  the  smallest  positive  integer 
required  to  make  ke  a multiple  of  N.  Thus,  ke  = lcm(  N,  k ).  It  follows  from  Equation  (2.11) 
that  ke  = Nk  / gcd(  N,  k ),  thus  e = N / gcd(  N,k ).  ♦ 

According  to  the  theorem,  if  gcd(  N,  k ) = 1 then  <9lV/a  = N.  This  implies  that  the  number 
of  roots  ot  order  N is  given  by  (p  (N).  Furthermore,  the  number  of  primitive  roots  modulo  M is 
given  by  <p  ( <p  (Af)) . Overall,  Theorem  2. 1 allows  one  to  calculate  all  the  roots  of  possible 
orders  from  one  of  the  known  primitive  roots.  To  calculate  the  multiplicative  inverse  of  the 
nonzero  integer  a , if  M is  a prime  one  finds  that  a-1  = aM~2  because  a a~{  = au~2  = 1 . In 
general,  every  nonzero  integer  a has  a multiplicative  inverse  of  the  form  a"1  =a*(iV/)" 1 Wlth 
gcd(a  , M ) = 1.  By  considering  M a composite,  one  observes  that  all  elements  will  not  have 
inverses.  The  elements  which  are  relatively  prime  to  M form  an  multiplicative  group  which  is 
a subset  of  the  ring  Zw  . The  idea  presented  above  is  illustrated  in  Example  2.1. 


-14- 


Rg-Oldgring  powers  of  a primitive  root.  This  propeny  states  that  if  the  first  <p  (Ar)  powers 
of  a primitive  root  of  N are  computed  modulo  N,  then  all  numbers  relatively  prime  to  N and 
less  than  N are  generated.  Stated  mathematically,  let  a be  a primitive  root  of  N.  Then  a1,  a2, . 
. . , a^,V)  are  congruent  modulo  N to  a\ , a2 a>(/V)  where  gcd(a, , N ) = 1 and  a,  < N. 

Example  2,1  : The  ring  with  arithmetic  modulo  15. 

Considering  Z15  ( arithmetic  modulo  15  ),  where  M = 15  is  a composite  number.  Then, 

<t>  (15)  = (5  - 1)  (3  - 1)  = 8 and  = 1 mod  15.  According  to  Equations  (2, 3),  the  set  of 
the  root  of  unity  is/?  = { AT  I (AT,  15)=  1 } = { 1,2, 4, 7, 8, 11, 13, 14  }.  As  shown  in  Table  2.1,  R 
forms  a multiplicative  group  of  order  8 which  is  a subset  of  the  ring  Z15  . 

Table  2.1  The  Elements  and  Their  Order  of  the  Multiplicative  Group 
of  the  Ring  Z 15 


a1 

a2 

a3 

a4 

N = 0„(a  ) 

a'1 

1 

- 

— 

— 

1 

1 

2 

4 

8 

1 

4 

8 

4 

1 

- 

- 

2 

4 

7 

4 

13 

1 

4 

13 

8 

4 

2 

1 

4 

2 

11 

1 

- 

— 

2 

11 

13 

4 

7 

1 

4 

7 

14 

1 

- 

- 

2 

14 

Table  2.2  shows  that  there  is  no  primitive  root  modulo  15,  because  there  is  no  root  having  the 

order  of0  (15).  However,  it  is  important  to  note  that  2, 7, 8, 13  are  primitive  4th  roots  of  unity. 

♦ 

Example  2.2  : The  field  with  arithmetic  modulo  17. 

Considering  Z17  ( arithmetic  modulo  17  ),  where  M — 17  is  prime.  According  to  Equation 
(9),  every  nonzero  integer  a in  Z 1 7 has  a multiplicative  inverse.  This  means  nonzero  ele- 


-15- 


Table  2.2  The  Possible  and  Actual  Order  of  the  Elements  in  the 
Multiplicative  Group  of  the  Ring  Z15 


Possible  order  N 

(Nim) 

Number  of  roots  of 
order/1/  (= 

Actual  number 
of  roots  of  N 

1 

1 

1 

2 

1 

3 

4 

2 

4 

8 

4 

0 

Note:  15  = 0 (8)  +0(4)  +0(2)  +0(1). 


mentsin  Z 17  form  a group  under  multiplication.  Furthermore,  Z17  is  also  shown  to  be  a com- 
mutative ring  w'ith  unity,  such  that  Zp  is  a finite  field  of  order  17  ( 0 (17)  + 1 = 16  + 1 ). 
By  applying  the  ideas  in  the  previous  discussion  and  referring  to  Table  2.3,  the  following 
observation  can  be  made: 

(1)  0 (17)=  17-  1 = 16. 

(2)  al6=  1 mod  17,  gcd(a  , 17)  = 1. 

(3)  The  number  of  the  primitive  roots  is  0 (0  (17))  = 8. 

(4)  The  primitive  roots  are  3,  5,  6, 7,  10,  1 1,  12,  14  and  also  called  the  primitive 
16th  roots  of  unity. 

(5)  The  roots  of  order  N are  called  primitive  Mh  roots  of  unity. 

(6)  The  possible  order  N of  roots,  such  that  /VI0  (17),  are  16,  8,  4,  2,  1. 

(7)  The  multiplicative  group  of  nonzero  elements  in  Z17  can  be  generated  by  any 
one  of  the  primitive  roots. 

(8)  For  primitive  roots  a , (a8)2=  1 mod  17.  Thus,  a8  =-l  mod  17  which  becomes 
a =16  mod  17.  These  primitive  roots  are  also  roots  of  the  equation  r8  + l=  0 


mod  17. 


- 16- 


Table  2.3  The  Elements  and  Their  Order  of  the  Field  Zp 


a1 

a 2 

a3 

a* 

a3 

a6 

a1 

a" 

a9 

a10 

a 11 

a12 

a13 

a11 

<2:S 

<f6 

;V 

a ‘ 

0 

9 

•? 

• 

1 

l 

i 

2 

4 

8 

16 

15 

13 

9 

1 

8 

9 

3 

9 

10 

13 

5 

15 

11 

16 

14 

8 

7 

4 

12 

2 

6 

1 

16 

6 

4 

16 

13 

1 

_ 

4 

13 

5 

8 

6 

13 

14 

2 

10 

16 

12 

9 

11 

4 

3 

15 

7 

1 

16 

7 

6 

2 

12 

4 

7 

8 

14 

16 

11 

15 

5 

13 

10 

9 

3 

1 

16 

3 

7 

15 

3 

4 

11 

9 

12 

16 

10 

2 

14 

13 

6 

8 

5 

1 

16 

5 

8 

13 

2 

16 

9 

4 

15 

1 

- 

8 

15 

9 

13 

15 

16 

8 

4 

2 

1 

— 

8 

2 

10 

15 

14 

4 

6 

9 

5 

16 

7 

2 

3 

13 

11 

8 

12 

1 

16 

12 

11 

2 

5 

4 

10 

8 

3 

16 

6 

15 

12 

13 

7 

9 

14 

1 

16 

14 

12 

8 

11 

13 

3 

2 

7 

16 

5 

9 

6 

4 

14 

15 

10 

1 

16 

10 

13 

16 

4 

1 

4 

4 

14 

9 

7 

13 

12 

15 

6 

16 

3 

8 

10 

4 

5 

2 

11 

1 

16 

11 

15 

4 

9 

16 

9 

4 

13 

8 

1 

- 

8 

8 

16 

1 

2 

16 

(9)  For  roots  oforder  4 which  are  4 and  13,  (a2)2  = 1 mod  17,  therefore,  a2  = -l 
mod  17  and  thus,  ar  = 16mod  17.Theroota  is  also  a root  of  the  equation  .t2  + 
1 =0  mod  17.  By  definition  if  the  congruence  x2  = c mod  M , where  (c,  M)  = 
1,  is  solvable,  then  c is  a quadratic  residue  modulo  M.  If  this  congruence  has 
no  solution,  c is  said  to  be  a quadratic  nonresidue  modulo  M.  For  instance,  -1 
is  a quadratic  residue  modulo  1 7 and  2 is  also  a quadratic  residue  modulo  1 7 — 
in  the  sense  of  1 1 2 = 2 mod  17  . 

(10)  To  find  all  the  primitive  roots,  one  simply  has  to  find  all  k with  (k,  16)=  1.  By 

raising  a known  primitive  root  to  power  k , all  primitive  roots  can  be  found.  In 

the  latter  example,  3 is  a primitive  root  modulo  17and£=  { 1,3. 5, 7, 9, 11, 13, 


-17- 


15  }.  The  set  A of  the  primitive  roots  is  A = { 31 , 33 , . . . , 315  } = { 3, 10,  5 

6}. 

(11)  Given  the  fact  that  M = 17=  2b  + 1 with  b = 4,  and  2b  = -\  mod  17.  This  makes 
(2b)2  = 1 mod  17  such  (2b)2  = 1 mod  17  such  that  integer  2 has  order  2b.  ♦ 

2.3.1  Chinese  Remainder  Theorem  (CRT)  for  Integers 

A special  case  of  this  theorem  is  credited  to  the  Chinese  mathematician  Sun-Tsu,  who 
wrote  sometime  between  200  B.C.  and  200  A.D..  A general  proof  given  by  Chiu-Shao,  Nico- 
machus  (Greek),  and  Euler.  When  considering  a nonprime  M,  Z,v/  is  a ring  and  the  inverses 
exist  only  for  integers  relatively  prime  to  M.  Let  M have  the  prime  power  factorization  found 
in  Equation  (2.4).  W'hen  the  arithmetic  is  performed  in  modulo  M,  it  is  in  effect  performed 
modulo  for  each  prime  power  pf  simultaneously.  A set  of  arithmetic  operations  can  be  per- 
formed modulo  for  each  pf  separately  with  the  final  result  mod  M obtained  using  the  Chi- 
nese remainder  theorem  (CRT).  The  general  CRT  theorem  for  integers  states  as  follows: 
Given  primes  p\,  p2,  . . pn  and  integers  ch  c2,  . . cn,  the  simultaneous  con- 
gruence C = ct  mod  pf  have  a unique  solution  mod  M,  and 

n 

C = ( ^ CiNiMi  ) mod  M (2.12) 

1=1 

where  Mt  = M / pf,  with  g.c.d(  Mhpf  ) = 1 , and  Nt  = (A/,)-1  mod  pf  . 

2.3.2  Residue  Number  System  Arithmetic  (RNSf 

Let  a and  b be  determined  by  the  CRT  from  the  sequences  of  integers  { <2;  ) and  ( bi  } , / 
— 1,2 , . . . ,L,  respectively  (i.e.,  = a mod  iV,-  and  bi  = b mod  N[ ).  Let  ® denote  the  opera- 

tion of  either  '•  ’or  *+’.  Then  it  is  easy  to  show  that 

( a ® b ) mod  N = { <g>  bx ) mod  Nx ( aL®  bL  ) mod  NL  ), 


-18- 


where  N = N\  ,V2  • • • A//,.  Thus  multiplication,  addition,  and  subtraction  involving  a and  b can 
be  accomplished  solely  by  operations  on  the  residue  digits  fl/  and  bt  . Such  arithmetic  is 
called  residue  number  system  (RNS)  arithmetic.  High  speed  digital  systems  can  be  mecha- 
nized by  parallel  processors  operating  on  the  residue  digits.  As  in  any  digital  system,  there 
are  overflow  constraints. 

Modulo  techniques  for  binary  systems.  Closely  related  to  the  concept  of  congruence  is 
the  idea  of  extracting  a residue.  Computations  involving  residues  are  usually  simple  because 
one  never  needs  to  w'ork  with  quantities  larger  than  the  modulus.  Two  procedures  which  sim- 
plify some  of  the  operations  are  ( 1 ) a:  ± y mod  M = [{x  mod  M ) ± ( y mod  M ) ] mod  M;  and 
(2) x ■ y mod  M = [ ( x mod  M ) ■ (y  mod  M ) ] mod  M. 

One  method  of  computing  ‘ a mod  M ' is  simply  to  divide  a by  M and  keep  the  remainder.  If  M 
is  a power  of  two,  and  a is  represented  on  a binary  machine,  then  a trivial  method  of  extract- 
ing a mod  M exists: 

k- 1 

a mod  2k  = (^,.2')  mod  2k  = • (2.13) 

» i=0 

This  operation  is  performed  by  masking  out  all  but  the  Ic  least  significant  bits.  Another  simple 
case  occurs  when  the  modulus  is  of  the  form  2*-l  . Let  the  first  k bits  of  a, 
a l,  • • •,  die- 1 , be  the  binary  number  A;  let  the  next  k bits  be  the  binary  number 
B - die  + 2 aic+\  + + • • • 

and  so  on,  then 

a = A + 2kB  + 2 2kC  + . . ..  (2.14) 

Since  2k  mod  (2*-  1)  = 1 , and  2"*  mod  (2k-  1)  = 1 , thus 

a mod  (2k-  1)  s (A  + B + C + . . . ) mod  2k  - 1 . (2.15) 

Consequently,  the  residue  of  a number  with  very  many  bits  may  be  found  by  adding  k-bit 
subwords. 


-19- 


Another  case  in  which  a residue  is  easily  extracted  without  division  is  when  the  modu- 
lus is  of  the  form  2k  + 1 . Let  a be  as  in  (2.14),  then  since  2k  = - 1 mod  (2k  + 1)  , and 
2^  = (-  1 )n  mod  (2*  + 1 ) , then 

a mod  (2*  + 1)  = {A  - B + C - . . . ) mod  2*  + 1 . 

Thus  we  have  seen  that  for  moduli  of  the  form  2k , 2k  ± 1 .methods  much  simpler  than  divi- 
sion exist  for  extracting  a residue. 

2.4  Residue  Polynomials 

Polynomials  with  coefficients  in  a field  (ring)  are  referred  to  as  polynomials  over  a 
field  (ring)  and  are  important  in  the  development  of  efficient  cyclic  convolution  evaluation. 
This  section  discusses  properties  of  polynomials.  Many  properties  are  analogous  to  the  prop- 
erties of  integers  discussed  in  the  previous  section.  For  example,  the  CRT  for  polynomials  is 
similar  to  that  for  integers  and  results  in  an  expansion  that  reduces  multiplications  in  FFT 
implementations. 

2.4.1  Properties  of  Polynomial 

The  theory  of  residue  polynomials  is  closely  related  to  the  theory  of  integer  residue 
classes.  For  each  field  F,  there  is  a ring  F[jc]  called  the  ring  of  polynomials  over  F.  Within  a 
ring  ot  polynomials,  subtraction  is  always  possible,  but  division  is  not.  We  write  f(x)  I g(x) 
and  say  that  the  polynomial  g(x)  is  divisible  by /( x)  — provided  there  is  a polynomial  q(x) 
such  that  g(x)  = q{x)-f(x).  A nonzero  polynomial  f{x)  that  is  divisible  only  by  f{x)  or  a , w here 
a is  an  arbitrary  field  element,  is  called  an  irreducible  polynomial.  In  the  case  that/(.t)  is  not  a 
divisor  of  g(.x),  the  division  of  g(x)  by  f(x)  will  produce  a quotient  and  residue  polynomial, 
such  that 


g(x)  = q(x)-f(x)  + r(x), 


(2.16) 


-20- 


w here  the  degree  of  r(x)  is  less  than  the  degree  of f(x).  Usually,  the  residue  polynomial  is  of 
more  interest  than  the  quotient  polynomial.  The  residue  polynomial  in  congruence  form  is 
written  as 

fix)  a g(x)  mod/(.t).  (2.17) 

The  process  of  obtaining  the  remainder  r(x)  from  g(x ) and  fix)  is  called  polynomial  residue 
reduction.  In  general  all  polynomials  of  the  same  residue,  when  divided  by  fix),  are  said  to  be 
congruence  modulo  fiz)  and  the  relation  is  denoted  by 

gfx)  = rfx)  mod  fix) . (2.18) 

Two  polynomials  which  differ  only  by  a multiplicative  constant  are  congruent.  Thus,  residue 
polynomials  deal  with  the  relative  values  of  coefficients  rather  than  with  their  absolute  val- 
ues. At  this  point,  it  is  worth  noting  that  when  dealing  with  polynomials,  primary  interest  lies 

in  the  coefficients  of  the  polynomial.  Consequently,  a set  of  N elements  <zo,  a\ a^_i  is 

arranged  in  the  form  of  a polynomial 

h(x)  = ao  + axx  + a2x2  + . . . + a,V- 1-^_1  (2.19) 

with* denoting  position.  This  feature  is  very  important  in  digital  signal  processing  because 
each  polynomial  coefficient  represents  a sample  of  an  analog  signal  stream  which  defines  its 
location  and  intensity. 

Equation  (2.18)  defines  equivalence  classes  of  polynomials  modulo  a polynomial/ft). 
It  is  easily  verified  that  the  set  of  polynomials  defined  with  addition  and  multiplication  mo- 
dulo/fr)  is  a ring  and  reduces  to  a field  when  f(x)  is  irreducible.  When  f(x)  is  not  irreducible,  it 
can  always  be  factored  uniquely  into  powers  of  irreducible  polynomials.  Note,  however,  that 
the  factorization  depends  on  the  tield  of  its  coefficients:  x2  + 1 is  irreducible  for  coefficients 
in  the  field  ofrational  numbers,  but  it  is  reducible  in  the  finite  field  Z17  where  x2  + 1 = (jc — 4) 
C*  - 1 3),  as  well  as  in  the  field  of  complex  numbers  where  x2  + 1 = (*  + i)(x  - i) , i = /I" f . 


-21  - 


The  CRT  for  polynomials.  Now  suppose  that fix)  is  the  product  of  n polynomials  ft(x) 
having  no  common  factors,  then  these  polynomials  are  usually  called  relatively  prime  poly- 
nomials. 

n 

/(*)  = Y\Mx)e‘ . (2.20) 

1=1 

Since  each  of  these  polynomials  fix)  is  relatively  prime  with  all  other  polynomials,  it  has  an 
inverse  modulo  every  other  polynomial.  This  means  that  the  CRT  can  be  extended  to  the  ring 
of  polynomials  modulo/fx)  which  allow's  the  unique  expression  of  g(x)  as  a function  of  poly- 
nomials gix)  obtained  by  reducing  the  g(x)  modulo  to  the  polynomials  fix ) . The  CRT  for 
polynomials  is  then  expressed  as 

n 

g(x)  = ( X six)  Nix)  Mix)  ) mod  f{x) , 

i=i 

where  Mix)  =f(x)/f(x)e‘ , with  gcd(  Mix),  f(x)e‘ ) = 1 and  Nix)  = M~\x)  mod  f(x)e‘ . The 
most  difficult  part  of  the  calculation  of  Nix)Mix)  relates  to  the  evaluation  of  the  inverse  of 
Nix)  which  can  be  accomplished  using  Euclid’s  algorithm. 

2.4.2  Convolutions  and  Polynomial  R\S 

Polynomial  algebra  plays  an  important  role  in  digital  signal  processing  because  convo- 
lutions and  discrete  Fourier  transforms  (DFTs)  can  be  expressed  in  terms  of  operations  on 
polynomials.  This  is  seen  by  considering  the  simple  convolution  gi  of  two  sequences  hn  and 
hm  of  N terms 

iV-l 

gl  = / 

„=o  (2.21) 

where  / = 0,  1, . . . , 2N-2.  Suppose  that  the  N elements  of  hn  and  hm  are  assigned  to  be  the 
coefficients  of  polynomials  h(x)  and  h(x)  of  degree  AM  injc,  where  a:  is  the  polynomial  van- 


dble.  Analogous  to  Equation  (2.19),  if  h(x)  is  multiplied  by  h (.t)  the  resulting  polynomial 
g(.r)  will  be  of  degree  2N-2.  Thus, 

2/V-2 

g(x)  = h(x)h(x)  = £ apt.  (2.22) 

1=0 


With  the  polynomial  multiplication,  each  coefficient  at  of  x1  is  obtained  by  summing  all 

products  hnhm  such  that  n + m = l.  It  follows  from  Equation  (2.21)  and  the  fact  that  m = l-n 
that 


N- 1 

ai  = 8 1 = S ML. 

n= 0 


Thus,  Equation  (2.22)  becomes 

2N-2 

g(x)  = 21  get . 

1=0 


(2.23) 


(2.24) 


This  means  that  the  convolution  of  two  sequences  can  be  treated  as  the  multiplication  of  two 
polynomials.  Moreover,  if  the  convolution  is  cyclic,  the  indices  l,  m , and  n are  defined  modu- 
lo M Thus  in  N- term  cyclic  convolutions,  N=0  mod  N which  implies  that  *v  = 1 . Therefore  a 
cyclic  convolution  can  be  viewed  as  the  product  of  two  polynomials  modulo  the  polynomial 
fix)  = .tv  - 1,  namely 

gi x)  = h(x)h'(x)  mod  f{x) . (-> 

According  to  the  previous  discussion  of  CRT  for  polynomials,  the  polynomial  g(x)  is  first 
computed  using  the  reduced  polynomial  equations,  h^x)  = h(x)  mod/Xa:)  and  ht(x)  = h\x) 

mod/Xa:)  , where  f(x)  s are  factors  of  Xw  - 1;  then,  the  n polynomial  products  are  evaluated 

gi(x)  = hi(x)h'i(x)  mod  f(x)  ... 

(2.26) 

with  g(x)  reconstructed  from  gt{x)  using  the  CRT. 


-23- 


Up  to  this  point,  the  problem  of  computing  the  Equation  (2.25)  can  be  simplified  some- 
what by  applying  the  CRT  — provided  the  polynomial /(.r)  can  be  factored  as 

fix)  = fl  fi(x) 

1= 0 ' 

In  accordance  with  Winograd’s  theorem  [Nus82J,  a convolution  algorithm  can  be  syn- 
thesized with  2 N-k  multiplications  where  k is  the  number  of  irreducible  factors  of  fix)  over 

the  field  F.  If  F is  the  field  of  complex  numbers  C,  then  Xs  - 1 factors  into  N polynomials  x - 

(ol  of  degree  1 where  a;  = exp(  -jln/N)  andy  = /I"!  . In  this  case,  the  computation  technique 
defined  by  Winograd  is  equivalent  to  the  DFT  approach  and  requires  only  N general  multipli- 
cations. Unfortunately,  the  roots  co  are  irrational  and  complex  so  that  the  multiplications  by 
scalars  corresponding  to  DFT  computation  must  also  be  considered  as  general  complex  mul- 
tiplications. 

Primitive  roots  of  unity.  The  elementtum  is  a primitive  root  of  unity  if  the  set  { (u)m)° , 
(aw)1,---.(<0AM  } can  be  reordered  as  {a>°  .ty1  }.  Drawing  the  unitcircletn  the 

complex  plane  and  showing  the  points  co°  ,...,a)N~l  verifies  that  com  is  a primitive  root 

if  and  only  if  gcd(  m,N)  = 1. 

Factorization  of  rv  - 1 . When  F is  the  field  of  rational  numbers  Q,  the  polynomial  .tv  - 
1 factors  into  polynomials  having  rational  number  coefficients.  These  polynomials  are 
called  cyclotomic  polynomials  and  are  irreducible  for  coefficients  in  the  field  of  rational 
numbers, 

=n  c.<z> 

where  C,(z)  is  a cyclotomic  polynomial  of  index  i and  the  values  of  i used  for  / / N include  1 
and  N . For  example,  z2  - 1 = (z  - 1 )(z  + 1 ) = C\ (z)  C2(z)  and  z4  - 1 = (z  - 1 )(z  + 1 )( z2  + 1 ) = 


-24- 


Ci(z)  C:(z)  C4(z).  Thus,  one  cyclotomic  polynomial  of  degree  nt  exists  for  each  divisor  \\ 
of  N and  ni  is  shown  to  be  <piNj . 

Example  2,3  : Polynomial  factorization  over  fields. 

Let  fix)  = x5  - 1 , then  fix)  is  factored  over  the  complex  field  C as 

4 

fix)  = 

1=0 

over  the  real  field  R as 

fix)  = (x-  1)  (x2  + 2 cos(2zr/5)x  + 1)  (x2  + 2cos(4nr/5)x+ 1), 
and  over  the  rational  field  Q as 

fix)  = (x  - 1)  (x4+x3+x2+x  + 1).  4 

Factorization  of  xrv  - 1 over  finite  fields.  In  order  to  obtain  the  maximum  k value,  the 
polynomial  factorization  over  a finite  field  must  be  considered.  Let/(x)  be  an  integral  poly- 
nomial ( i.e.  with  integer  coefficients  ) of  degree  m,  and  M be  a natural  number.  If  c is  an 
integer  such  that /(c)  is  divisible  by  A/,  then  c is  a solution  of  the  algebraic  congruence 

fix)  = 0 mod  M,  (2.28) 

and  all  values  x for  which  x = c mod  M are  also  solutions.  All  the  solutions  belonging  to 
the  same  residue  class  modulo  M as  c are  considered  as  a single  solution.  Therefore  to  deter- 
mine all  the  solutions  of  the  congruence  in  Equation  (2.28),  one  need  only  try  the  values  x = 0, 
1,  2, ...,  M - 1.  It  is  known  that  [Lip81]  the  algebraic  congruence  of  degree  m in  Equation 
(2.28),  if  M is  prime,  has  at  most  m incongruent  solutions  modulo  A/.  According  to  Equations 
(2.8)  and  (2.9),  it  follows  that  the  congruence 

x^~l  -1=0  mod  M 

has  roots  x=  1,2,3 M- 1.  This  means/(;t)  = x^~x  - 1 can  be  factored  over  the  finite  field 

Z.v/  as 


-25- 


M-2 

.rw_1  - 1 = [1C*- a m°d  M , 

i=0 

where  a is  one  of  the  prime  root  mod  M. 


2.5  Finite  Fields  Based  on  Polynomial  Rings 

The  Field  of  integers  modulo  a prime  number  is  the  most  familiar  example  of  a finite 
field,  but  many  of  its  properties  extend  to  arbitrary  finite  fields.  The  characterization  of  finite 
fields  shows  that  every  finite  field  is  of  prime-power  order  and  that,  conversely,  for  every 
prime  power  there  exists  a finite  field  whose  number  of  elements  is  exactly  that  prime-power. 
Furthermore,  finite  fields  with  the  same  number  of  elements  are  isomorphic  and  may  there- 
fore be  identified.  The  parallel  between  the  ring  of  integers  and  the  ring  of  polynomials  over  a 
field  must  be  apparent.  Both  are  special  cases  of  an  algebraic  structure  called  a Euclidean 
ring. 

2.5.1  Construction  of  the  Finite  Field  GF(  pm ) 

Let/(jt)  be  a polynomial  of  degree  m in  Fp  [jc]  , then  by  the  previous  discussed  Euclidean 
division  algorithm  and  the  technique  of  polynomial  residue  reduction  the  rules  of  composi- 
tion in  Fp  [jc]  are 

1 ) Addition 

a(x)  + b(x)  = g\(x)  = r\(x)  mod  f(x)  , 

where  giU)  = q\{x)f(x)  + rx(x) . 

2)  Multiplication 

a(x)  . b(x)  = g2(x)  = riix)  mod  f(x)  t 


where  <?2U)  = Q2(xV(x)  + r2(x) . 


-26- 


Consider  the  set  G of  all  polynomials  of  degree  less  than  m,  it  is  known  [Mac77  ] that  the 
set  G is  of  order  pm  and  forms  an  abelian  group  with  respect  to  modulo  fix)  addition.  Further- 
more, if  fix)  is  irreducible,  all  nonzero  elements  of  G have  multiplicative  inverses.  Hence,  the 
set  G forms  a field  with  respect  to  modulo/(jt)  arithmetic.  In  generating  fields  using  modulo 
fix)  arithmetic,  the  field  Fp  from  which  the  coefficients  of  the  polynomial  are  chosen  is 
called  a subfield  of  G.  The  field  G is  called  the  extension  field  of  degree  m over  F n and  is 
denoted  by  Fpm  or  GF(  //” ).  On  the  other  hand,  all  the  multiples  of  the  polynomial/(.xj  form 
an  ideal  which  is  denoted  by  iff t) ).  By  applying  residue  reduction  to  this  ideal,  a residue 
class  ring  known  as  the  ring  of  polynomials  modulo  fix)  is  formed  and  is  denoted  by  Fp  [xYi 
fix) ).  It  is  a well  established  fact  that  an  isomorphism  between  the  extension  field  /y»  and 
the  residue  class  ring  Fp  [xYifix) ) exists.  The  extension  field  Fp » may  also  be  viewed  as  a 
vector  space  over  Fp.  This  view  results  from  the  following  properties. 

The  elements  of  Fp  form  an  abelian  group  under  addition.  Moreover,  each  “vector”  P e 
F^r  can  be  multiplied  by  a “scalar”  he  Fp  such  that  bP  e /y» . The  laws  for  scalar  multipli- 
cations are  also  satisfied: 

biP  1 + Pi  ) = bP\  + bPi, 
i a + b)P  1 -aP  l + bP\ , and 

1 P = P 

where  a,  b e Fp,an&P\,Pi,P  e . Another  way  of  viewing  the  extension  concept  of 
extension  fields  is  based  on  the  fact  that  the  irreducible  polynomial/(.r)  over  Fp  has  a root  in  a 
field  F pm  . Hence,  it  is  said  that  the  smallest  extension  field  Fpm  is  obtained  by  adjoining  a 

root  a of  fix)  to  the  ground  field  Fp  and  is  denoted  by  Fpio. ).  The  vector  form  representa- 
tion of  this  field  is 

Fr  = FP^a  ) = ( bo  + b\a  + b^z2  + . . . + : a e Fp~  , h,  e Fp  } . (2.29) 


-27- 


One  of  the  important  properties  of  the  additive  structure  of  a finite  field  is  illustrated  as 
follows.  Let  Fr  be  an  arbitrary  finite  field  of  order  r.  The  field  contains  the  unit  element  1 and 
since  the  field  is  finite  which  means  that  the  elements  1, 1 + 1 = 2, 1 + 1 + 1 = 3,  ...can  not  all 
be  distinct.  Therefore,  there  exists  a smallest  number/?  such  thatp=  1 + 1 + . . . + 1 (p  times  ) = 

0.  This/?  must  be  a prime  number  ( for  if  rs  = 0 then  r = 0 or  s = 0 ) and  is  called  the  characteris- 
tic of  the  field.  Thus  for  r = qn , q=  pm  w here  p is  prime,  one  concludes  that  Fr  is  an  exten- 
sion field  of  degree  n over  Fq , which  is  a field  of  degree  m over  the  prime  subfield  Fp ; and 
all  these  fields  have  the  same  characteristic  p. 

A similar  analysis  shows  that  for  any  nonzero  element  a e Fr  there  is  a positive  inte- 
ger t such  that  a1  = 1,  the  least  such  t being  called  the  order  of  a.  Moreover,  the  integer  t is  a 
divisor  of  r-  1 which  is  the  order  of  the  multiplicative  group  of  Fr ; and  so,  a satisfies  ar~l  = 

1.  In  the  case  ofr  = r- 1,  such  elements  are  called  primitive  elements  of  Fr . The  most  promi- 
nent feature  of  finite  fields  concerns  their  cyclic  multiplicative  groups.  The  elements  of  Fr 
in  power  form  are  written  as 

Fr  = { 0,  1,  a,  a2,  ...  , ar~2  } . (2.30) 

It  is  convenient  to  introduce  a formal  symbol  - <»  defined  by  the  equation  a~x  =0,  so 
that  the  general  elements  of  Fr  can  be  expressed  in  power  form.  If  the  element  ft  = x is  a 

primitive  element  of  the  field  generated  by  f(x),  then/(.t)  is  called  a primitive  polynomial. 
The  following  theorem  states  the  property  of  a primitive  polynomial. 

Ihgprem  2.2  Hf(x)  is  a primitive  polynomial,  then  5 = r - 1 is  the  smallest  integer  such  that 
f(x)  divides  Xs  - 1. 

Proof.  If f(x)  divides  Xs  then  x*  - 1 =f[x)  g{x).  That  isx5  =0  mod f(x),  hence  = 1 

mod/(x).  Since/(x)  is  a primitive  polynomial  then  x is  a primitive  element  of  the  field  Fr , 

thus  s = pm  - 1 is  the  order  of  x and  also  the  smallest  integer  such  that  Xs  = 1 mod/(x).  ♦ 


-28- 


Example  2.4  : The  representations  of  the  elements  in  the  field  f16  (orGF(24)). 

To  represent  the  elements  of  F16  in  this  way,  regard  it  as  a simple  algebraic  extension  of  F2 
of  degree  4 which  is  obtained  by  adjoining  a root  a of  an  irreducible  polynomial/(.v)  over  F 2 . 
Consider  defining  irreducible  polynomials  as  primitive  and  non-primitive.  Let/i(.t)  = x4  +.t 
+•  1 and  f2(x)  = .t4  + x?  + + x + 1 . Because  .t15  = 1 mod  f\(x),  that  a = x is  a primitive 

element  of  the  finite  field  F2UK  /i(*) ) and  the  polynomial  f\(x)  is  a primitive  polynomial. 
From  the  fact  that  x5  = 1 mod/2(*),a:  =x  is  thus  an  element  of  order  5.  This  implies  that  /2(.t) 
is  not  a primitive  polynomial.  However,  a = x + 1 is  shown  to  be  a primitive  element  of 
F 2 [xYihix) ).  Accordingly,  the  isomorphic  relationships  within  finite  fields  conclude  that 
F\b  — Fi[xY(f\(x) ) — F 2 [xYifiix) ).  The  elements  of  the  fields  represented  in  power  form 
and  vector  form  are  shown  in  Table  2.4  and  Table  2.5.  ♦ 


Table  2.4  Elements/?  =«3a3  + aya1  + a\a  + ao  of  F^Yx  ) 
Generated  byf\{x). 


p 

<P0  ) 

a3 

a2 

a\ 

ao 

— 00 

a 

0 

0 

0 

0 

0 

a° 

1 

0 

0 

0 

1 

a 

15 

0 

0 

1 

0 

a 1 

15 

0 

1 

0 

0 

a3 

5 

1 

0 

0 

0 

a4 

15 

0 

0 

1 

1 

a5 

3 

0 

1 

1 

0 

a6 

5 

1 

1 

0 

0 

p 

) 

a3 

a2 

ai 

ao 

a 7 

15 

1 

0 

1 

1 

a8 

15 

0 

1 

0 

1 

a9 

5 

1 

0 

1 

0 

a10 

3 

0 

1 

1 

1 

a11 

15 

1 

1 

1 

0 

a12 

5 

1 

1 

1 

1 

a13 

15 

1 

1 

0 

1 

a14 

15 

1 

0 

0 

1 

-29- 


Table  2.5  Elements/?  =a2a}  + aya1  + a\a  + ar>  of  F2(a  ) 
Generated  by  fiix). 


fi 

4><0  ) 

a3 

ai 

a\ 

ao 

fi 

<P(fi  ) 

^3 

d2 

tfl 

ao 

crx 

0 

0 

0 

0 

0 

a1 

15 

0 

1 

1 

1 

a0 

1 

0 

0 

0 

1 

a8 

15 

1 

0 

0 

1 

a 

15 

1 

1 

0 

0 

a9 

5 

0 

1 

0 

0 

a- 2 

15 

0 

1 

0 

1 

aio 

3 

1 

1 

0 

0 

a3 

5 

1 

1 

1 

1 

a11 

15 

1 

0 

1 

1 

a4 

15 

1 

1 

1 

0 

a12 

5 

0 

0 

1 

0 

a5 

3 

1 

1 

0 

1 

a13 

15 

0 

1 

1 

0 

a6 

5 

1 

0 

0 

0 

a14 

15 

1 

0 

1 

0 

2.5.2  Properties  of  the  Finite  Field  GF(  om  ) 

A useful  property  of  the  finite  field  of  prime  characteristic  is  described  by  Theorem  2.3. 
Theorem  2.3  In  a field  of  characteristic  p,  {a  + pf  = ap  + /P  for  a,  fi  e F. 

Proof.  By  the  binomial  theorem 

( a + 0 f = d>  + ^ j cf-'fi  + . . . + afP~l  * (2  31) 

claiming  that  for  1 ^ k S p - 1 , the  binomial  coefficient 
( p 1 pip-  1)0  — 2)  . . . (p-lc  + 1) 

T\  (2.32) 


h as  p as  a factor.  First  we  note  that/:!  divides/?!/?  - 1)-  • -(p-k  + 1),  since  the  binomial  coeffi- 
cient are  integers.  But  (/:!,/?)=  1,  since p is  prime.  Consequently,  Equation  (2.32)  becomes 


-30- 


for  some  integer  m.  Note  that  for  any  x in  a domain  of  characteristic  p , mpx  = 0.  Thus,  every 
term  on  the  right-hand  side  of  Equation  (2.31)  vanishes  except  aP  and  f}p.  ♦ 

As  an  immediate  corollary,  the  following  relationship  holds.  ( <Zq+  a\  + . . . + ^ f = 

af)  + + . . . + cts  , for  cti  € F,  i = 0,  1 , . . . , s,  and  ne  N. 

Now  consider  further  details  on  the  properties  of  finite  fields  such  as  minimal  polyno- 
mials, conjugates,  and  automorphisms.  According  to  Theorem  2.3,  (a  + ^f=ap  + ^p.  Since 
OP  =0,  \p  = 1,  and  ( a(if=ap(}p , it  follows  that  the  operation  of  taking pth  powers  preserves 

all  the  structure  of  the  finite  field  GF(  pm  ).  Plainly,  the  p‘  th  power  function  will  also  define 
an  automorphism  of  GF(  pm  ) for  each  /.  Fermat’s  theorem  implies  that  every  element  /?  of 
GF(  pm  ) satisfies  the  equation  x ^ -x  = 0.  Where  this  polynomial  has  all  its  coefficients  from 
the  subfield  GF(  p ) and  is  monic,  /?  may  satisfy  a lower  degree  equation.  A polynomial 
of  the  smallest  degree  with  coefficients  in  GF(  p ) such  that  <p(fi)  = 0 is  called  the 
minimal  polynomial  of  /?  over  GF(  p ) and  is  denoted  by  <J>^  (x).  The  following  theorems  il- 
lustrate properties  of  minimal  polynomial. 

Theorem  2.4;  Let/(.v)  be  a polynomial  of  degree  / with  coefficients  in  GF(  p ),  which  is  irre- 
ducible in  this  field,  and  let  /?  be  a root  of f{x)  in  an  extension  field.  Then  /? . (¥,  (¥' , . . . , ft3 
are  all  the  roots  of/(x). 

Proof.  Let  f{x)  = ao  + a\x  + . . . + ape1  , then  by  Theorem  2.3  ( /(*)  f =Oo  +aplxp  + . . . 
+ aplxlp  = ao  + ai(xp)  + ...  + ai(xp)l=f(xp).  Uf(fi)  = 0,  then/(^)  = (f(£)y>=0.  By  induction, 
one  finds  that  ft3  , . . . , (V3  are  roots  of f(x).  The  following  argument  shows  that  these  / field 
elements  are  distinct.  Suppose  the  field  elements  are  not  distinct,  and  ft3 ' = ^ , and  suppose  j 
</.  Then/?  = {P  ={fip  f = (fi)P  , therefore  1 = &{p  _1).  Thus,  the  order  of/? 


-31  - 


divides  pl~J  ‘ - 1.  Since/(.r)  differs  from  the  minimal  polynomial  of  P by  at  most  a constant 
factor  and  l +j-i<l,  this  contradicts  the  definition  of  minimal  polynomial.  Therefore  p . Pp, 
are  distinct.  Since/Ct)  can  have  at  most  / roots,  these  must  be  all  the  roots.  ♦ 

Theorem  2,$:  Let f{x)  be  an  irreducible  monic  polynomial  with  coefficient  from  GF(  p ) for 
which  f(ji ) = 0,  where  P is  an  element  of  some  extension  field  of  GF(  p ).  Then,  f(x)  = (x). 

Proof.  Let/U)  = q(x) fy (x)  + r(x).  Since  (P ) = 0,  and f(p ) = q(p)<&p(p)  + r(P)  = 0,  then 

r(P)  = 0.  This  implies  O^fc)  divides/fc),  however f(x)  is  irreducible,  such  that  f(x)  = <Pp(x). 

♦ 

Theorem  2.6:  All  the  roots  of  an  irreducible  polynomial  have  the  same  order. 

Proof.  Given  by  Peterson[Pet72a],  ♦ 

Theorem  2.7:  If/(.x)  is  irreducible  with  p as  a root  then  deg(/(x))  is  the  smallest  / such  that  ppl 

= P. 

Proof.  Given  by  Peterson[Pet72a].  ♦ 

Example  2.5  Calculation  of  the  minimal  polynomials  of  P in  GF(24). 

Consider  the  field  GF(  24 ) formed  by  modulo  f(x)  arithmetic  with  f(x)  = x4  + .t  + 1 . Find  the 

minimal  polynomial  tyix)  of  P in  GF(24).  From  Theorem  2.4,  one  has  = (x  ~P ) (x- 

Pp) . . . (.r  - Pp  ).  Thus,  the  minimal  polynomials  of  P in  GF(  24)  are  listed  as  follows: 

= (x  - a ) (x  - a2  ) (x  - a4  ) (x  - a8  ) ; 

Oa3(x)  = (x  - a 3 ) (x  - a6  ) (x  - a12  ) (x  - a24  ) 

= (x  - a3  ) (x  - a6  ) (jc  - a12  ) (x  - a9  ); 

= (x  - a5  ) (jc  - a10  ) ; and 


®A-X)  = (x  - a1  ) (x  - a14  ) ( x - a2S  ) (x  - a56  ) 
= (x  - a1  ) (jc  - a14  ) (x  - a13  ) ft  - a11  ) . 


-32- 


To  solve  the  coefficients  of  the  minimal  polynomial,  start  evaluating  <&a  ft).  Assume  cj>a  U) 
= ao  + a\x  + aix2  + a^x3  + a*xA  . Because  <I>a  (x)  is  monic  and  irreducible,  thus  oj  = 1 , uq=  1 . 

From  Oa(a  ) = 0,  one  finds  that 

1 + a\a  + aid 2 + aid3  + a4  = 0 , 

1 + a\a  + aia.1  + aid3  + (a  + 1 ) = 0,  and 

( a\  + 1 )d  + a^a1  + ayd3  = 0 . 

These  indicate  that  a\  = 1 , £12  = 0,  and  123  = 0.  Thus  <I>a  (*)  = .x4  + x + 1 . In  the  case  of  <I>a3(.t), 

let  = a0  + a\x  + <i2X2  + ayx3  + a^xA.  Similarly  <24  = 1,  a0  = U and  1 + a\d3+  a^a6  + 

a^d9  + d 1 2 = 0.  From  Table  2.4,  it  is  found  that  db=d3 + d2,  a9 = a.3 +d  , and  a 1 2 = a3 + a2 + a 
+ 1;  such  that  Equation  (2.8)  becomes 

1 + a\d}  + a2(  d3  + d2  ) + ai(  d3  + d ) + ( cc3  + d2  + d + 1 ) = 0 

( a\  + a2  + <23  + 1 )d3  + ( Q2  + 1 )d2  + ( ai  + 1 )d  = 0 

which  implies  <23  = 1,  a2  = 1,  a\  = 1 and  <J>a3(jc)  = 1 + x + x2  + x3  + ,r4.  ♦ 


2.5.3  Bases  of  GF(  nm ) over  GFf  p ) 

The  finite  field  GF(  pm ) can  be  regarded  as  a vector  space  of  dimension  m over  GF(  p ). 
Any  set  of  m linearly  independent  elements  can  be  used  as  a basis  for  this  vector  space.  Thus, 
the  number  of  distinct  bases  is  usually  rather  large  [Lid86].  Accordingly,  it  is  wise  to  restrict 
one’s  attention  to  certain  special  types  of  basis.  Different  vector  basis  representations  of  field 
elements  result  in  different  architecture  of  field  arithmetic  system.  Due  to  their  respective 
distinct  features  that  make  them  suitable  for  specific  applications,  this  research  effort  investi- 
gates three  basis  types  of  particular  importance.  They  are  primal  basis  ( or  standard  basis  ), 
normql  basis,  and  dual  basis  ( or  complementary  basis ).  A system  based  on  primal  bases  does 


-33- 


not  require  basis  conversion,  hence  it  is  readily  matched  to  any  input  or  output  system.  The 
normal  basis  system  is  effective  in  performing  operations  such  as  finding  inverse  element, 
performing  the  squaring  ( in  GF(  2 ) ) or  exponentiation  of  a field  element.  A dual  basis  sys- 
tem can  easily  perform  multiplication  of  a field  element  x and  the  power  of  a primitive  ele- 
ment a . A detailed  discussion  involving  the  architectural  designs  stemming  from  these 
bases  follow  later  in  the  chapter. 

It  is  important  to  introduce  the  concept  of  the  trace  of  a field  element  w hich  is  a very 
useful  analytic  tool  in  finite  fields.  Let  a be  an  element  of  GF(  pm ),  then  its  trace  function 
Tr(a  ) over  GF(  p ) is  defined  as 

Tr(a)  = a + aP  + apl  + . . . + ap~l  . (2.34) 

The  trace  function  satisfies  the  following  properties  (shown  without  proofs): 

For  all  a , p e GF(  pm ), 

1)  Tr(a  ) e GF(  p ),  the  function  is  a linear  transformation  from  GF (pm ) onto 
GF(  p ); 

2)  Tr(a  +£  ) = Tr(a  ) + Tvifi  ); 

3)  TrU  a)  = X • Tr (a  ),  if  A e GF(  p ); 

4)  Tr ( a y7  = Tr(  ap  ) = Tr(a  );  and 

5)  Tr(  a ) = na,  a e GF(  p ). 

Primal  basis.  The  most  popular  basis  is  primal  basis 

Np  = ( I,  a,  a2,  . . . , a"-1  } (2.35) 

consists  of  the  powers  of  a defining  element  a of  GF(  pm ) over  GF(  p ). 

Dual  basis.  The  second  type  of  basis,  dual  basis,  is  defined  that  bases 

Na  = ( a0,  ax an_x  ) (2.36) 


and 


-34- 


Np  = { Po,  P\,  ...»  Pn- 1 } (2.37) 

of  GF(  pm  ) over  GF(  p ) are  dual  bases  if  for  1 <.  i,  j <,  n-  1 , one  has  Tr  (a,/3, ) = (5i; . 
where  dtJ  is  the  Kronecker  delta  function,  equal  to  1 if  i is  equal  to  j and  zero  otherwise. 

Normal  basis.  The  last  type  of  basis  is  normal  basis 

Nn  = { a,  aP,  aP\  . . . , aP" } (2.38) 

which  is  defined  by  a suitable  primitive  element  a of  GF(  pm  ). 

Example  2.6  The  bases  in  GF( 24 ) over  GF(  2 ). 

Continuing  Example  2.4,  let  a e GF(24)  be  a primitive  root  of  the  irreducible  polynomial 
/i(x)  in  .Then,  Np  = { \,a  , a2,  a3 } is  a primal  basis  over  GF(24).  Its  uniquely  deter- 
mined dual  basis  is  easily  checked,  which  is  = { 1 + a3,  a1,  a ,1  ) . The  basis  N = ( a , a2, 
a4,  a8  } is  not  a normal  basis  because  a +a}  + aA  +a8  =0  such  that  N is  not  linearly  inde- 
pendent. However,  a normal  basis  can  be  found  by  choosing  another  primitive  root  a3  and 
raising  it  to  the  powers  of  2 such  that  Nn  = { a3,  a6,  a 12 , a24  } = { a3,  a6,  a12 , a9  }.  ♦ 


CHAPTER  3 

COMPUTATIONS  IN  FINITE  FIELDS 
3.1  Introduction 

For  obvious  reasons  the  arithmetic  over  finite  field  GF(  q ) has  to  be  distinguished  be- 
tween the  cases  that:  1 ) The  order  of  the  field  is  a prime  that  q =p,  then  arithmetic  in  GF(  q ) is 
that  of  Zp , the  field  of  integers  mod  p;  2)  The  order  of  the  field  is  a prime  power  that  q-  pm , 
m>\,  then  the  arithmetic  in  GF(  q ) becomes  that  of  an  m-dimension  algebra  over  GF(  p ).  In 
the  later  case,  the  complexity  of  system  structure  becomes  more  involved  on  the  size  of  the 
field  and  also  on  the  data  structure  which  is  used  to  represent  elements  of  the  field. 

In  the  previous  process  of  constructing  GF(  pm ) from  GF(  p ),  two  representations  - the 
power  form  and  the  polynomial  form  - for  the  nonzero  elements  of  GF(  q ) were  developed. 
In  this  chapter,  the  structures  and  applications  of  these  representations  are  investigated. 
Zech’s  logarithm  [Con68]  solves  the  problem  of  the  nontrivial  addition  operation  of  power 
form  representation  in  finite  fields.  However,  as  the  size  of  fields  becomes  larger,  the  conver- 
sion between  power  and  polynomial  forms  shows  to  be  a challenge  task  w hich  is  traditionally 
termed  as  the  discrete  logarithm  and  exponentiation  problem,  and  treated  in  separate  as  an 
interest  research  topic  in  the  area  of  cryptography.  Hence,  it  is  not  proper  to  conduct  arithme- 
tic operation  in  the  power  form  w-hen  the  field  has  large  order.  However,  in  polynomial  struc- 
ture, multiplication  becomes  a nontrivial  task.  Three  different  algorithms  for  multiplication 
are  investigated  based  upon  the  basis  representation  of  polynomial  form  in  finite  fields.  Fi- 
nally, the  concept  of  discrete  Fourier  transforms  in  finite  field  are  reviewed  to  be  applied  to 
the  development  of  the  fast  Galois  field  transforms  in  the  following  chapters. 


-35- 


-36- 


3.2  Index  Calculus 

Index  of  /?  Relative  to  a ( incL/7  or  logoff  ).  Let  a be  a primitive  root  of  N and 
gcd(/?  ,.V)  = 1,  then  k is  the  index  of  relative  toa  (written  k = ind^  ) if  it  is  the  smallest 
positive  integer  such  that  /?  = ak  mod  N.  The  utility  of  indices  is  due  to  the  follow  ing  loga- 
rithm-like relationships  they  obey: 

indatfb  = ( indcjd  + ind ab  ) mod  0 (N) ; 
ind abl  = ( / ind ab  ) mod  <p  ( N) ; 
ind^l  = 0 mod  <t>  {IV) ; 
ind^a  = 1 mod  0 (N)  . 

For  example,  Table  3.1  shows  the  indices  of  numbers  relatively  prime  to  N = 9.  According  to 


Table  3. 1 Illustration  of  Indices  for  N = 9 


k = ind2<J 

l 

2 

3 

4 

5 

6 

a=  2k  mod  9 

2 

4 

8 

7 

5 

1 

the  table  and  the  fact  that  0 (9)  = 9(1  - 1/3)  = 6,  the  following  cases  obtained 

1)  indj8  ■ 7 = ind28  + ind27  = 1 mod  6,  and  2ind;;8'7  = 21  =8-7  mod  9; 

2)  ind28  = ind223  = 3 (ind22)  = 3; 

3)  ind2l  =6  = 0 mod  6;  and 

4)  ind22  = 1.  ♦ 


3.3  The  Table  Lookup  Method  For  Small  Fields 

In  the  case  of  small  finite  fields,  it  is  more  efficient  to  use  the  pow'er  form  to  represent 
field  elements  while  applying  the  table  lookup  method  for  its  arithmetic  operations.  Fast 
memory  devices  are  good  candidates  for  the  implementation  of  tables  as  long  as  the  table 


-37- 


capacity  does  not  exceed  the  current  device  technology.  The  following  algorithms  are  used  to 
perform  the  arithmetic  operations  in  finite  fields. 

All-Table  (AT)  approach.  Tables  can  be  implemented  using  fast  memory  devices,  such 
as  programmable  read-only  memory  (PROM)  and  static  random- access  memory  (SRAM), 
w ith  rA  + rB  bits  of  address  line  and  rAB  bits  of  data  output  line.  The  notations  rA  and 
i'b  represent  the  number  of  bits  of  two  input  operands  A and  B respectively,  and  rA.B  repre- 
sents the  number  of  bits  of  the  output  under  an  operation  of  A and  B.  For  i = A , B,  or  AB , 
rt  = log2  Qi  where  <7,  is  the  range  of  data  in  the  finite  field  GF(  q ). 

In  this  approach,  the  memory  capacity  requirement  which  is  regarded  as  the  number  of 
gates  of  the  memory  devices  is  calculated  approximately  as 

T=  2{r*  + rB)  ■ rA  B 

= Q a • Qb  ■ log2  Qa  b — 0(  q1  log 2 q ) . 

An  implementation  of  a finite  field  arithmetic  unit  is  depicted  in  Figure  3.1  where  the 
precomputed  contents  based  on  the  type  of  the  arithmetic  operation  is  loaded  to  the  lookup 
table  memory  initially. 


rA 

Input  A / ^ 

Lookup  Table 

rA.B  Output 

rB 

7^ ^ 

Input  B 

7^ ► 

AoB 

Figure  3.1  A Finite  Field  Arithmetic  Unit  Based  On  the  All-Table  Approach 


Mixed  Table-Logic  (MTU  approach.  The  capacity  requirement  on  the  factor  of  q2  in 
the  AT  approach  prevent  the  operations  from  full  application  of  the  table  lookup  method.  Let 
a be  a primitive  element  and  be  a nonzero  element  of  GF(  q ),  the  index  of  / 3 with  respect 


-38- 


to  the  base  a is  the  uniquely  determined  integer  r where  0 < r < q- 1 and  ^ - ar  from  the 
construction  of  finite  fields  discussed  in  previous  section,.  The  index  r is  called  the  discrete 
logarithm  of  £ and  denoted  by  r = log ap  (or  ind^  ).  Its  inverse  function,  denoted  as 
expa(r)  = ar  = /? , is  called  the  discrete  exponentiation.  Applying  the  isomorphism 

loga  : GF(  q )*  — Z^_2,  (3.1) 

for  a primitive  element  a e GF(  q ),  multiplication  can  be  transformed  into  addition  modulo 
q- 1 — provided  that  logarithm  and  exponentiation  tables  are  available. 


A multiplier  based  on  the  discrete  logarithm/exponentiation  approach  is  shown  in  Fig- 
ure 3.2 . Note  that  the  number  of  gates  required  for  a suitable  memory  is  given  by  0{q\ogiq). 


Figure  3.2  A Discrete  Logarithm/Exponentiation  Multiplier 


Comparing  to  the  AT  approach,  a <7- fold  saving  in  terms  of  memory  size  is  obtained  which 
becomes  more  significant  when  the  order  q of  a field  becomes  large. 

When  the  elements  of  a field  are  represented  in  logarithmic  format,  addition  becomes  a 
nontrivial  task.  Applying  group  property  of  GF(  q )*,  Zech’s  logarithms  [Con68]  suggests  an 
efficient  way  to  perform  exponential  addition  by  introducing  a second  level  of  table  lookup. 
Let  Mq  = { 0, 1, . . . , q-2  } u { - ® } in  GF(  q ),  Zech’s  logarithm,  denoted  as  Z(n),  is  defined 
as 


Z(n)=  loga(  1 + an) 


(3.2) 


/ 


-39- 

for  all  elements  ne  \tq.  It  is  obvious  that  Z is  a mapping  Z : Mq  — > \lq  . Therefore,  addition 
performs  as  follows:  Let  z be  the  exponent  of  the  sum,  aZ  + ay , which  is  ar  = a*  + a- , then 
az  = aZ  (1  + a 

= a?  (1  + a*-*)  ■ (3.3) 

Applying  Zech’s  logarithm  defined  in  Equation  (3.2),  the  exponent  of  the  sum  becomes 
z = x + Z(  y-x  ) = y + Z(  x-y  ).  (3.4) 

Example  3.1  The  Zech’s  logarithms  in  GF(24)  and  their  addition  operation. 

Zech’s  logarithms  in  the  finite  field  GF(24)  which  is  generated  by  the  polynomial  f\(x)  as  is 
in  Example  2.1  are  listed  in  Table  3.2. 


Table  3.2  The  Zech’s  Logarithms  in  GF@4  ) 


n 

- oo  0 1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

Z(n) 

0 - oo  4 

8 

14 

1 

10 

13 

9 

2 

7 

5 

12 

11 

6 

3 

Accordingly,  the  example  of  addition  of  elements  a3  and  a 1 is  given  as 
a3  + a1  = a?  ( 1 + a4)  = a 3 aZ(4)  = a4  . ♦ 

Having  the  nontrivial  addition  problem  solved,  the  arithmetic  operations  in  a small  fi- 
nite field  GF(  q ) can  be  carried  out  rather  easily.  Let  inputs  A=ax  and  B = ay,  and  the  iso- 
morphism in  Equation  (3.1)  hold,  then  the  arithmetic  operations  of  the  finite  field  GF(  q ) are 
summarized  as  follows. 

1)  Addition: 

A + B >-*  [x  + Z(y-x)]  mod  (^-1),  (3.5) 

or 


A + B >-*  [y  + Z(.t-y)]  mod  (<7~1); 


(3.6) 


-40- 


2)  Additive  Inverse:  Since  p - 1 = -1  in  GF(  q ) of  characteristic  p.  discrete  logarithm  func- 
tion is  used  to  Find  k such  that  ak  = p - 1.  Then,  the  inverse  of  B becomes 

-B  = -a>  = akay  = . (3.7) 

Therefore,  the  isomorphic  mapping  is 

-B  *->  [*  + y]  mod  (<7-l)  ; (3.8) 

3)  Subtraction:  According  to  the  fact  that  ak  = p - 1, 

A-B  >-*  [x  + Z(k  + y-x)]  mod  (<7-  1)  ; (3.9) 

4)  Multiplication: 

A-B  •->  [x  + v]  mod  (q-\)  ; (3.10) 

5)  Multiplicative  Inverse:  Since  B~l  = a~y  = aq~{~y , thus 

fi-1  q _ v - 1;  (3.11) 

6)  Division: 

A / B [x  + (q-  1 -y)]  mod  (q-  1) ; (3.12) 

7)  Power: 

An  >->  nx  mod  (q  - 1) . (3.13) 

Logarithmic  MTL  arithmetic  unit.  A Finite  Field  arithmetic  unit  based  on  power  form 
representation  of  Field  element  is  developed  and  shown  in  Figure  3.3.  The  logarithm  tables  in 
the  front  end  serve  as  the  isomorphic  mapping  deFined  in  Equation  (3.1)  whereas  the  ex- 
ponentiation table  in  the  back  end  serves  as  the  isornorphic  inverse  mapping.  In  the  discrete 
logarithm  domain  of  the  arithmetic  unit,  operations  such  as  addition,  subtraction,  multiplica- 
tion, and  division  are  earned  out  based  on  the  algorithms  developed  in  the  preceding  section. 
The  type  of  arithmetic  operation  is  chosen  by  controlling  the  ‘mode  selection’  signal  at  the 
multiplexer  of  the  output  stage.  Since  the  bypass  of  the  Zech’s  logarithm  table  lookup  which 


-41  - 


Input  A Inputfi 


Output 


Figure  3.3  A Logarithmic  Finite  Field  Arithmetic  L'nit 


related  to  those  additive  operations,  the  multiplication  operation  is  the  fastest  among  all  other 


ones. 


-42- 


Computation  of  Zech’s  logarithm  table  Since  a is  a primitive  element  in  GF(^),.t  and 
aZ  are  in  one-to-one  correspondence  for.t  = - oo,  0, 1, . . q- 2.  The  relationship,  according- 
ly, also  exists  between  x and  Z{x),  hence  Zech’s  logarithm  is  a bijective  mapping.  By  know- 
ing:x,  Zech’s  logarithm  can  be  obtained  from  the  brute-force  approach  according  to  their  def- 
inition in  Equation  (3.2). 

Provided  that  the  logarithm  or  exponentiation  tables  as  those  shown  in  Table  3.1  or 
Table  3.2  are  available,  the  following  algorithm  is  used  to  compute  the  Zech’s  logarithm 
table. 

Algorithm  3.1 

1 ) The  vector  form,  which  is  [ *o,  *i,  . . .,  x„-i  ],  of  the  field  element  or*  is  obtained  from 
the  precomputed  exponentiation  table; 

2)  Using  modulo p arithmetic  where p is  the  characteristic  of  GF(<7),  add  1 to  the  least  signifi- 
cant tuple  of  the  vector  form  of  a x.  This  leads  to  the  fact  that  the  least  significant  tuple  be- 
comes xq  = *0  + 1 mod  p ; 

3)  Zech’s  logarithm  Z(x)  is  then  obtained  by  applying  the  the  resultant  vector 
[ xq,  *1,  . . .,  .t„_i  ] to  the  logarithm  table.  ♦ 

Although  these  logarithms  are  useful  for  computation  purposes,  the  construction  of  the 
table  of  Z(x)  is  rather  cumbersome.  However,  by  applying  the  concept  of  the  cyclotomic  co- 
sets along  with  the  properties  of  Z(x)  introduced  by  Imamura  [Ima80]  and  Huber  [Hub90],  a 
time-saving  method  can  then  be  developed. 

Some  properties  of  ZU)  are  first  reviewed:  lfcrx*0,  that  is x * - oo  , then  aZ(<?_1_Jc)  = 1 + 

aq~l~x . Since  aq~l  = 1,  it  turns  out  that  cflq~l~x)  = 1 + a~x  = {ax  + \)a~x  = ^x)~x . There- 
fore, 


Z(  q-l-x  ) = [ Z(x)  -x  ] mod  ( ^-1  ). 


(3.14) 


-43- 


From  Theorem  2.3,  the  relationship  ( 1 + a?  f = 1 + apx  holds  in  GF(<?)  which  leads  to  the 
result  that  = 1 + apx  = ( 1 + aZf  = aPZ(x) . Therefore, 

Z(px)  = pZ(x)  mod  ( <7-1  ).  (3.15) 

The  inverse  function  Z~\x ) of  Zech’s  logarithms  is  derived  by  defining  the  following  iter- 
ated function  Z~‘  : Mq  — > Mq  and 

Z\x)  = Z(Zl~\x))  (3.16) 

where  i = 2, 3, . . . and  Z1  = ZQc).  Obviously,  a?{x)  = / + a?  and  cP{x)  = a * . Thus,  Z(  Z~\x) ) 
= x=  Zp(x)  = Z(  Zp~\x)  ).  Accordingly,  the  inverse  function  Z~l(x)  becomes 

Z~H x)  = Zp-\x ) . (3.17) 

When  p = 2,  Z~\x)  = Z{x) . However,  forp  ^ 2,  Z~l(x)  is  derived  as  follows.  By  choosing  r 
such  that  ar=  2/  and  from  Equation  (3.16),  = t + ax  = 2i  + p-  i+  ax=  ar(  1 + 

aZ?>-'(;r)-v)  _ a/- . aU2r\x)-r) 

Z‘W  = r + Z(  Zp_l(jc)  -r).  (3.18) 

For  the  case  i'  =p-l,  ar  = p-2.  Equation  (3.18)  becomes 

Zp~\x)  = Z~\x)=r  + Z(  Z(x)  - r ).  (3.19) 

The  cyclotomic  coset  concept  is  linked  to  the  calculation  of  Z(x)  by  the  property  stated 

in  Equation  (3.15).  For  the  coset  Cs  = { s,  sp , sp ",  . . . , spl,~l },  and  the  known  logarithm 
Z{s),  then  the  corresponding  mapped  coset  is 

Cs  = { Z(s),  pZ(s),  p2Z(s),  . . . , pl’~xZ{s)}. 

Note  that  Zech’s  logarithm  maps  cosets  onto  cosets  of  the  same  length.  Having  the  known 
logarithm  Z(s),  by  either  repeatedly  applying  Equation  (3.14)  or  the  inverse  function  in 
Equation  (3.17)  or  (3. 19),  all  other  cosets  of  the  same  length  are  obtained.  Thus,  by  simply 
computing  the  key  element  Z(s)  for  the  set  of  cosets  of  the  same  length,  the  Zech’s  logarithm 
table  will  be  constructed  easily. 


-44- 


Example  3.2  Computation  of  the  table  of  Z(x). 

Consider  the  field  GF (pm)  = GF(24),  constructed  with  the  primitive  polynomial  f\(x)  = 
r4  + x + 1 . From/jfx)  the  key  element  Z(l)  = 4 is  obtained  immediately  since  jfil)  = 1 + 
-v1  = x4  mod/jU)  • Using  Equations  (3. 15)  and  (3. 17)  gives  the  Zech’s  logarithm  of  all  ele- 
ments in  Ci , where  Cx  = { 1, 2, 4,  8 } . Z(  1 ) = 4,  Z( 4)  = 1,  Z(2)  = 8,  Z(8)  = 2.  Having  Z(  1 ) = 4, 
applying  Equation  (3. 14),  we  findZ(  14)  = 3 alsoZ(3)  = 14.  Hence  elements  in  C3  = ( 3, 6, 12, 
9 } will  be  one-to-one  corresponding  to  the  set  { 14,  13,  11,7  ).  Finally,  for  the  only  coset  of 

length  2,  C5  = { 5,  10  ),  we  immediately  know  Z( 5)  = 10,  Z(10)  = 5.  The  result  meets 
Table  3.2. 

Another  example,  for  field  GF(25)  with  defining  polynomial^)  = .x5  + x2  + 1 , we 

get  the  key  element  Z(2)  = 5.  The  following  list  represent  the  procedure  to  find  the  Zech’s 
logarithm  table. 

1)  Cj  = { 1,2,  4,  8,  16  }; 

2)  Apply  (3.15)  to  Z( 2)  = 5,  C5  = { 18,  5,  10,  20,  9 } is  obtained; 

3)  Apply  (3.14)  to  Z(18)  = 1,Z(13)=  -17=  14  is  obtained; 

4)  Apply  (3.15)  to  Z(13)  = 14,  Cn  = { 13,  26,  21,  11,  22  } is  obtained; 

5)  C7  = {14,28,  25,  19,7  ); 

6)  Apply  (3.14)  to  Z(14)=  13,  Z(17)  = -l  =30  is  obtained; 

7) C3  = { 17,3,6,  12,24}; 

8)  Apply  (3.15)  to  Z(17)  = 30,Ci5  = { 30,29,27,23,  15  }is  obtained.  ♦ 

The  algorithm  based  upon  the  concept  of  cyclotomic  coset  has  improved  the  calcula- 
tion of  Zech’s  logarithm.  The  brute  force  method  takes  pm  calculation  for  the  finite  field 

GF(  pm ) while  the  new  algorithm  takes  k operation  where  k is  the  number  of  cosets  of  the 
different  coset  length. 


3.4  Discrete  Logarithm  and  Exponentiation  Problems 


If  a finite  field  is  small  enough,  one  can  tabulate  all  the  field  elements  and  their  loga- 
rithms, and  use  this  table  for  computation  within  the  field,  much  as  one  uses  a table  of  natural 
logarithms  for  calculations  involving  real  numbers.  However,  for  large  fields,  for  instance 

that  of  GF(2127),  it  is  infeasible  to  tabulate  its  logarithms.  Thus  the  construction  of  discrete 
logarithmic  and  exponential  tables  presents  a great  challenge.  Furthermore,  from  previous 
discussions,  the  table  lookup  method  requires  a memory  capacity  on  the  order  of  pm  log2  pm . 
As  for  the  current  state  of  the  an  on  semiconductor  device  technology,  high  speed  megabit 
memories  can  only  provide  capacity  to  the  fields  of  order  pm  = 106  . 

Aside  from  the  intrinsic  interest  that  the  problem  of  computing  discrete  logarithms  has, 
it  is  of  considerable  imponance  in  cryptography  [Adl79,  Blak84],  The  use  for  cryptography 
depends  on  the  apparent  one-way  nature  of  the  discrete  logarithm  function:  it  is  relatively 
difficult  to  extract  logarithms  in  a large  field,  which  it  is  relatively  easy  to  exponentiate.  Sev- 
eral proposed  algorithms  for  computing  discrete  logarithms  are  known.  For  the  example  that 

the  field  GF(2127)  and  the  primitive  polynomial  involved  isf(x)  = x127  + x + l,Adleman’s 

algorithm  [Adl79]  seems  to  take  two  weeks;  a modification  due  to  Blake  [Blak84]  takes 
about  nine  hours. 

The  discrete  exponential  function  expa(r)  = of  in  Fq  (orGF(^))can  be  calculated  for 
1 < r < <7-1  by  the  algorithm  which  is  often  called  the  square-and-multiply  technique.  In  de- 
tail, the  elements  a.  , a , <z4, . . . , are  first  computed  by  repeated  squaring,  where  2e  is 
the  largest  power  of  2 that  is  smaller  than  r.  Then  of  is  obtained  by  multiplying  together  an 
appropriate  combination  of  these  elements.  For  instance,  to  get  a27one  would  multiply  to- 
gether the  elements  a , a2,  a8  and  a16.  A simple  analysis  shows  that  the  calculation  of  a' 
requires  at  most  2Llog2<7  J multiplications  in  Fq  . Until  recently,  the  inverse  problem  of 


-46- 


computing  discrete  logarithms  in  Fq  was  believed  to  be  much  harder,  since  for  one  of  the 
best  algorithms  available  then  the  required  number  of  arithmetic  operations  in  Fq  was  of  the 
order  of  magnitude  q~-  . If  the  order  q is  sufficiently  large,  for  instance  q > 2100,  exponentia- 
tion in  Fq  might  justly  have  been  regarded  as  a one-way  function.  However,  great  progress 
has  recently  been  achieved  [Blak84,  Lid86]  in  the  computation  of  discrete  logarithms,  which 
makes  it  necessary  to  construct  cryptosystems  based  on  discrete  exponentiation  in  a careful 
manner  in  order  to  prevent  them  against  attacks  by  these  recent  algorithms. 

A discrete  logarithm  algorithm  for  Fq  is  now  discussed,  which  is  the  index-calculus 
algorithm  [Blak84],  Suppose  a is  a primitive  element  of  Fq  where^=  p"1 . The  algorithm 

to  find  the  discrete  logarithm  loga/3  of  an  arbitrary  nonzero  element  /?  of  Fq  consists  of 

two  stages.  The  first  stage  which  generates  precomputed  results  serves  as  a data  base  for  the 
second  stage  of  the  algorithm. 

Algorithm  3.2  The  Computation  of  the  Discrete  Logarithm  of  F ^ . 

1)  Precomputation  Stage: 

1.1)  Compute  the  discrete  logarithms  of  all  elements  of  a chosen  subset  V of  Fq.  The  set  V 
usually  consists  of  all  the  monic  irreducible  polynomials  over  Fp  of  degree  < s,  where  s < m. 

1.2)  loga  d is  known  for  allt/e  Fp  . For  p-  2,  logafl)  = 0.  For  p > 2 we  use  the  observation 

that  a = is  a primitive  element  of  Fp  (because  a'(M)  = a<?-t  = l),  and  d = 

(a')loga'd  = aK‘?-i)/(P-i)liog<I  <*  Such  that 

log ad  = [(<7-1  )/(p- 1 )] log- d , for  all  d e Fp  . 

The  value  log^  d may  be  obtained  by  direct  calculation. 

1.3)  Choose  a random  integer,  t,  with  1 < t < q- 2,  and  form 

a\  = a‘  mod/(.t),  deg(  a\  )<m. 


-47- 


Then  factor  a \ into  irreducible  polynomials  over  Fp  . If  all  the  monic  irreducible  factors  are 
elements  of  V,  so  that 

= d n v'v(a,)  (3  ->0) 

Vev 

with  d e Fp  is  the  canonical  factorization  in  Fp  [jc],  then 

' = [ loga(d)  + X ^(ai)logav  ] mod  (^  — 1)  . (3.21) 

2)  Main  Stage: 

2.1)  Similar  to  1.3),  form  the  polynomial  e Fp  [jc],  (ix  =p  .a1  mod  /(jc),  deg(£i  )<m, 
and  thus 

log aP  = [ log a(d)  + X logd  v - t ] mod  (q-  1) 

v£V 

2.2)  If  /?!  does  not  have  the  desired  type  of  factorization,  choose  other  values  of  t until  this 
type  of  factorization  is  obtained.  ♦ 

Example  3,3  Compute  the  Discrete  logarithm  of  (i  = x4  + x3  + x2  + x + 1 in  FM . 

We  consider  F 54  defined  by  f(x)  = x6  + x + 1 6 Fi  [,r].  Then  take  a = x as  a primitive 

element  of  F54  since  f(x)  is  a primitive  polynomial  over  F2  . First  we  compute  the  data  base 
of  V. 

1 ) Suppose  the  maximum  degree  s of  irreducible  polynomials  in  the  set  V is  2 such  that  V = { 
x,  x+l,  JC2  + x + 1 }.  Apparently  log*(x)  = 1.  We  choose  integer  t with  1 < t < 62.  A good 
choice  is  t = 6,  since  a{  =a‘  = x6  = (x+  1)  mod/(x),  hence  loga(;t  + 1)  = 6.  Another  good 
choice  is  t = 32,  since  xM  = x = x6  + 1 = (x}  + l)2  mod  f(x)  implies 
ax  = (x3  + 1)  s (x+  IXx2  + x + 1)  mod  f(x)  . 

This  yields 

32  s logalx  + 1)  + logaCt2  + x + 1)  = 6 + logafr2  + x + l)  mod  63  , 


hence  logxU2  + x + 1)  = 26. 
2)  The  choice  of  t = 2 yields 


-48- 


/?i  s (.x4  + x3  + x1  + x + 1)jc2  s (x6  + x5  + x4  + x3  + x2) 

3 x5  + x4  + r3  + x2  + x + 1 3 (x2  + x + 1)2(X  + 1)  mod  f(x)  , 
hence  all  the  irreducible  factors  are  in  V.  Therefore 

log*<0)  = 2 log gCx2  + x + x + 1)  + logger  + l)-2 
3 2 • 26  + 6 - 2 3 56  mod  63  , 
and  so  log*(;t4  + jc3  + x2  + x + 1)  = 56 . ♦ 

Up  to  this  point,  the  computations  in  finite  field,  especially  multiplication,  are  easily 
performed  by  using  power  form  representation.  However,  the  problems  of  discrete  logarithm 
and  exponentiation  offset  the  advantages  when  the  order  of  a field  becomes  large.  Therefore, 
the  computations  of  finite  fields  in  vector  forms  are  considered.  Under  such  formats,  opera- 
tions like  addition  and  subtraction  are  trivial  since  they  compute  in  a component-wise  fash- 
ion over  the  ground  field  GF(p ).  Whereas,  multiplication  and  division  are  more  complicated 
and  time-consuming.  Based  upon  the  representation  of  field  elements  in  vector  form,  there 
are  various  methods  in  designing  multiplier  in  finite  fields.  In  next  section,  the  architectures 
based  on  primal,  dual,  and  normal  bases  are  investigated.  These  designs  can  be  applied  and 
integrated  to  the  finite  field  computation  systems  which  will  be  developed  in  later  chapters. 

In  the  finite  field  GF(  Pm ),  the  most  common  type  of  circuit  uses  a linear  feedback  shift 
register  to  form  the  desired  product  sequentially  in  m clock  cycles.  This  circuit  is  simple  and 
relatively  economical,  but  it  operates  rather  slowly  because  of  the  m flip-flop  delays  in- 
curred. The  possible  solutions  to  the  slowness  due  to  the  sequential  operation  are  also  given. 
A performance  analysis  for  each  design  architectures  is  investigated  to  show  the  design  tra- 
de-off of  these  architectures.  The  representation  of  the  components  in  circuit  diagram  of  fi- 
nite field  multiplier  is  shown  in  Figure  3.4. 


-49- 


Figure  3.4  The  Building  Block  of  the  Finite  Field  Multipliers,  a)  Storage  element  such 
as  register  and  accumulator,  b)  Multiple  input  Modulo  p adder,  c)  Modulo  p multipli- 
er for  multiplying  a data  with  gi.  d)  Modulo  p adder. 


3.5  Primal  Basis  Multiplier 

Let  a € GF(  Pm  ) be  expressed  in  primal  coordinates  as 

a = ao  + a\a  + a^a2  + . . . + am-\am~l  (3.22) 

where  a is  a primitive  element  in  GF(  Pm ) and  Np  be  a primal  basis  generated  by  a . Assume 
the  irreducible  defining  polynomial  is 

Ax)  = /o  + f\x  +/2X2  + . . . + X"  . (3.23) 

Such  that  f(a  ) = 0 or 

= (~/o)  + i-f\)a  + {-fi)a2  + . . . + (-/m_1)erm_1  (3.24) 

is  used  to  reduce  a polynomial  to  the  one  with  degree  less  than  m.  Before  the  calculation  of 

the  product  of  elements  in  GF(  P ),  the  multiplication  of  a field  element  a and  the  primitive 
element  a and  its  power  a 1 is  derived  first.  Since  the  product  of  a and  a is 

aa  = aoa  + axa 2 + a^q?  + . . . + + am_lam  . (3.25) 

By  applying  Equation  (3.24)  and  let  gi  = -fi , the  product  aa  becomes 

00  = (-/o^m-i)  + (ao-/ia«_i)a  + (at -f2am^)a2  + . . . + {am-2-fm-\am-\)c^~x 
= (godm-i)  + (ao  + giam_i)a  + (ax  + g2am_l)a2  + ...  + (am_2  + gm-Xam_x)am-1 . 


(3.26) 


-50- 


This  relationship  suggests  an  implementation  of  the  product  aa  by  an  /71-digit  shift  register 
array,  RG,  ’s,  and  a feedback  network  which  routes  the  most  significant  digit  am_\  to  the  rest 

of  the  digits  according  to  the  polynomial  reduction  mechanism.  A generic  primal  basis  aa  - 
multiplier  is  shown  in  Figure  3.5.  In  terms  of  propagation  delay  and  hardware  cost,  this  im- 


Figure  3.5  A Generic  Primal  Basis  aa  - Multiplier 


plementation  results  in  a one  level  time  delay  of  modulo  addition  and  j modulo  p multipliers 
and  adders,  where  j is  the  number  of  nonzero  gi . 

In  the  case  of  1 < / < pm  -1,  the  evaluation  of  aa1  is  achieved  simply  by  clocking  the 
registers  continuously  to  have  / times  right  shifts,  then  the  final  result  will  stay  in  the  regis- 
ters. Thus,  the  product  aa  can  be  obtained  at  most  in  m—  1 clocks.  This  is  achieved  by  the 
following  mechanism. 

From  the  fact  that  the  defining  polynomial  f(x)  is  used  to  reduce  a polynomial  to  the 

degree  less  than  m , a1  can  always  be  represented  in  a primal  basis.  Thus  aa1  has  the  polyno- 
mial form  as 


= a ( ho  + h\a  + . 

. . + hm_ 2am-2  + hm_xam~'  ) 

= aho  + ah\a  + . . . 

+ ahm-^am~2  + ahm_\am~x 

= /to  • a + h\  ■ aa  + . 

. . . + hm. 2 ■ aam~2  + hm_i  ■ aa‘ 

(3.27) 


-51  - 


This  equation  can  be  interpreted  as  that  aa1  is  obtained  by  summing  the  cyclic  shifted  terms 
hi-  aa1  where/ = 0,1,.  . .,m-l.  Therefore,  aa1  is  obtained  in  m-\  clocks  even  /may  be  the 

value  up  to  pm  -2.  The  first  algorithm  for  implementing  the  multiplication  is  listed  below: 

Algorithm  3.3  Primal  Basis  Accumulated  Multiplication. 

INPUT:  [ ao,  a\,  ...  am_i,  am_\  ] ( Parallel-In  ) 

[ho,  h\,  ...  hm_  2,  hm_  i ] ( Serial-In;  hi  first  in  ) 

0.  Initialization: 


RG, 

ai,  for  / = 0,  1, . . .,  m-1. 

AC, 

- 

0; 

1.  Clock  1: 

RG, 

- 

aa  mod  f{a  ), 

AC, 

- 

ho- a'. 

2.  Clock  2: 

RG, 

- 

aa 2 mod /(a  ), 

AC, 

- 

[hi  - aa  \ mod /(a  ), 

AC, 

= 

[ ho  ■ a + h\  ■ aa  ] mod  f(a  ), 

j.  Clock  j: 

RG, 

- 

aa1  mod  f(a  ), 

AC, 

- 

[ hj_  i • aaJ~[  ] mod  f(a  )t 

AC,  - [ ho  • a + h\  • aa  + h2  ■ aa2  + . . . + hj_\  - aaJ  1 ] mod  f(a  ) 
m- 1.  Clock  m- 1: 

RG,  ♦-  aam~l  mod  /(a  ), 

AC,  [ hm_2  • cuf1-2  ] mod  f(a  ) ; 


-52- 


AC,  = [ ho  ■ a + hi  ■ aa  + h2  ■ aa2  + . . . + hm_2  ■ aam~2  ] mod  f(a  ); 
m.  Clock  m:  RG,  *-  aam  mod /(a  ), 

AC,  *-  [ /Jm-i  • ] mod  /(a  ) , 

AC,  = [ ho  • a + hi  ■ aa  + h2  ■ aa2  + . . . + hm_{  ■ aam~l  ] mod  f(a  ). 

OUTPUT:  Stored  in  the  accumulator  ACs  ( Parallel-Out ).  ♦ 

Thus,  a generic  primal  basis  accumulated  multiplier  based  on  the  algorithm  is  devel- 
oped and  shown  in  Figure  3.6.  The  components  /z,  of  a 1 are  sequentially  clocked  to  be  avail- 


Figure  3.6  A Primal  Basis  Accumulated  oa^-Multiplier 

able  for  processing  and  a is  simultaneously  loaded  to  the  register  at  the  initial  clock.  The  con- 
tents of  the  register  is  updated  by  the  most  significant  digit  (MSD)  of  the  feedback  digit  at 

clock  i and  the  component  of  aa1  are  accumulated  at  clock  i+ 1 . Iterating  such  procedures,  all 
the  component  of  aa 1 will  simultaneously  be  available  at  mth  clock.  The  summing  operation 
is  done  locally  at  each  component  of  aa 1 by  the  accumulator  ,ACs,  therefore,  k copies  of  the 
accumulators  where  k is  the  number  of  nonzero  of  hi  exist.  The  architecture  of  this  design  is 


-53- 


regular  and  the  advantage  of  this  design  is  high  speed  processing  capability  in  the  output 

stage  due  to  the  single  level  delay.  To  compute  the  product  of  a and  b where  a,  be  GF(  Pm  ), 
let  a is  represented  as  in  Equation  (3.22),  while  b is  represented  similarly  as 

b = b0  + bxa  + ...  + bm_2am~2  + bm^am~x , 

then 

ab  = a bo  + ab\a  + . . . + abm^m~2  + abm-\am~x 

= bo  ■ a + b\  ■ aa  + . . . + bm_2  ■ act*-1  + bm-\  ■ aam~l . 

This  is  exactly  the  form  of  Equation  (3.27).  Thus,  a ^-multiplier  is  realized  by  applying  the 
same  design  as  that  in  Figure  3.6.  Another  implementation  of  a6-multiplier  is  based  on 
nested  form  of  polynomial  multiplication  and  is  demonstrated  in  the  following  algorithm: 
Algorithm  3.4  Primal  Basis  Nested  Multiplication. 


INPUT:  [ ao,  a\,  ...  am_2,  am-\  \ ( Parallel-In ) 

[ bo,  b\,  ...  bm_2,  bm. i ] ( Serial-In;  bm- 1 first  in  ) 
0.  Initialization: 


RG, 

1.  Clock  1:  RG, 


0,  for  / = 0,  1,  • • -,m- 1; 

bm- 1 • ^ ; 


2.  Clock  2:  RG,  *-  [ ( bm.\  ■ a )■  a + bm_2 • a ] mod  f{a  ); 

3.  Clock  3: 


RG,-  «-  [ ( bm_i  ■ a a + bm_2  ■ a ) ■ a + - a ] mod  f{a  ) , 


j.  Clock  j: 

RG,  •*-  [ ( bm- 1 • a ■ aJ~~  + . . . + bm_1+\  • a ) a + bm_j  • a ) mod  f(a  ), 


-54- 


m.  Clock  nr. 

RG,  *-  [ ( bm_\  ■ a • a"*-2  + ...+b\-a)-a  + bo - a]  mod  /(a  ) t 
= [ bm- 1 • a ■ am~[  + . . . + b\  ■ a ■ a + bo  ■ a ] mod  f(a  ) t 
= [ <3  • ( bm_ i • am~{  + . . . + b\  • a + bo  ) ] mod  f(a  ), 

= [ a • b ] mod  f(a  ). 

OUTPUT:  Stored  in  the  register  RGs  ( Parallel-Out ).  ♦ 


A generic  primal  basis  ^-multiplier  is  developed  and  shown  in  Figure  3.7.  There  is  only  one 


Figure  3.7  A Primal  Basis  Nested  ab  Multiplier 
set  of  registers  needed  to  store  the  intermediate  nested  product.  One  of  the  input  b is  fed  into 

the  system  sequentially  by  starting  with  the  MSD,  bm_\  . Iterating  such  procedures,  all  the 
component  of  ab  will  simultaneously  be  available  at  nth  clock.  The  architecture  of  this  de- 
sign is  simple  and  less  hardware.  However  the  processing  clock  cycle  is  of  longer  due  to  the 
double  level  delay  in  the  three  input  modulo  p adders. 


-55- 


3.6  Dual  Basis  Multiplier 

For  Np  a primal  basis  and  a a primitive  dement  in  GF(  P7” ),  there  exists  a dual  basis 
with  respect  to  Np.  Let  a be  expressed  in  the  dual  coordinates  as 

a = acpo  + a\fi\  + . . . + am-\pm-\.  (3.28) 

Since  the  element  aj  of  Np  is  o'  in  polynomial  basis,  we  know  from  Equation  (3.25)  that 

dj  = Tr(oay)  = T r(aaJ)  . (3  29) 

To  evaluate  aal  in  the  dual  coordinate  system,  we  start  from  the  case  of  / = 1. 

aa  = {aajop 0 + ( aa)'\fii  + ...  + (aa)'m^]flmr.x  . 

Fory  = 0,  1, . . rn- 2,  from  Equation  (3.29),  one  have 

(aa)'j  = Tr  {aa-aj)  = Trfrto^1)  = ay+1  . (3  30) 

Furthermore  for  j = m- 1 

(aa)m_ i = Tr (aa-am_1)  = Tr(aam)  (3  3D 

By  applying  Equation  (3.29)  and  the  polynomial  reduction  as  that  shown  in  Equations  (3.24) 
and  (3.26),  Equation  (3.31)  becomes 

(acc)'m~i  = Tr(  a ■ [(-/0)  + (-/i)a  + (-/2)a2  + . . . + ) 

- Tr(-/o<a)  + Iv(-fiaa)  + Tr(-/2aa2)  + . . . + Tr {-fm_\aam~x) 

= (-/o)Tr(a)  + (-/i)Tr(aa)  + (-/2)Tr(aa2)  + . . . + (-/„>_  1)Tr(aa'”-1) 

= goa o + ^1^1  + g2a2  + ...  + gm-\dm-\  . (3  32) 

The  relationships  in  Equations  (3.30)  and  (3.32)  suggest  a simple  implementation  of  the  mul- 
tiplication by  a shift  register  array  and  a feedback  network  which  is  a tree  of  modulo  p adders. 
It  is  obvious  from  above  equations  that  the  burden  of  the  operation,  both  in  time  and  cost,  is 
heavily  located  on  the  evaluation  of  the  most  significant  digit  ( aa)'n_y  . In  terms  of  propaga- 
tion delay  and  hardware  cost,  this  implementation  results  a d level  delay  of  addition  time  with 


-56- 


d = flog  and  k modulo  p adders  required  where  k is  the  number  of  nonzero  gi . Figure  3.8 


Tr (aam)  feedback  network 


Figure  3.8  A Generic  Dual  Basis  aa  -Multiplier 

shows  a generic  dual  basis  aa  -multiplier  where  the  circuitry  of  the  feedback  network 
Tr(aam)  is  defined  by  f(x)  as  in  Equation  (3.32). 

In  the  case  of  1 < / < P -1 , a1  can  always  be  represented  in  primal  basis.  Thus,  aa1  has 
the  same  polynomial  form  as  that  in  Equation  (3.27).  The  following  observation  on  the  con- 
tents of  the  shift-register  of  aa*  will  be  helpful  for  the  development  of  the  dual  basis  multipli- 
ers. the  contents  of  the  registers  are  represented  in  vector  form  to  have  a better  view  of  their 
relationships,  where 


a 

= [a'o 

a\ 

02 

• • am-2 

0m- 1 ] 

aa 

= [a'i 

a'i 

03 

• • 0m- 1 

To  ] 

xa 2 

= [di 

03 

04 

• • T0 

T'i  ] 

aam~' 

~ [0m- 1 

To 

ri 

• • Tm_x 

Tm- 2 ], 

-57- 


and 

T\  =(aaI+1)m_1. 

By  regarding  the  piled-up  vectors  as  a matrix,  it  is  shown  to  be  symmetrical.  This  leads 

i t 

to  a conclusion  that  the  z'th  components  ( acr)i  of  the  vector  aa1  which  are  obtained  by  sum- 
ming the  z’th  component  of  the  rows  of  the  matrix  can  also  be  obtained  by  summing  the  col- 
umns of  z'th  row  of  the  matrix.  Based  upon  the  observation,  the  architecture  of  the  dual  basis 
multipliers  are  developed  as  follows. 

The  first  design  approach  is  based  on  that  aa1  can  be  obtained  by  summing  of  the  rows 
of  the  matrix  and  the  algorithm  which  is  developed  in  Algorithm  3.3.  The  input  a is  simulta- 
neously loaded  to  the  register  at  the  initial  clock.  The  register  has  feedback  digit  and  is  up- 
dated at  clock  i and  the  component  of  aa1  are  accumulated  at  clock  z+ 1 . All  the  component  of 

aal  are  simultaneously  available  at  nth  clock.  A generic  dual  basis  accumulated  aa 1 multi- 
plier is  shown  in  Figure  3.9.  The  architecture  of  this  design  is  regular  but  need  more  hard- 


Figure  3.9  A Dual  Basis  Accumulated  aa1  -Multiplier 


ware.  The  advantage  of  this  design  is  high  speed  processing  capability  in  the  output  stage  due 
to  the  single  level  delay  which  is  independent  to  the  number  j.  Thus,  the  system  clock  cycle  is 


-58- 


depend  solely  on  the  feedback  network.  The  second  design  algorithm  is  based  on  the  obser- 
vation that  xal  can  be  obtained  by  summing  the  columns  of  the  matrix. 

Algorithm  3.5  Dual  Basis  Summed  Multiplication. 

INPUT:  [ aQ,  ax,  ...  am_ 2,  am_\  ] ( Parallel-In  ) 

[ho,  hi,  ...  /zm_2,  hm_\  ] ( Parallel-In  ) 

0.  Initialization: 

RG,  *“  a,,  for / = 0,  1,.  . .,m- 1; 

(«*')o  = hoa0  + hxax  + ...  + hm_2a'm_2  + hm_xam_x  , 

1.  Clock  1:  RG,  *-  aa  , 

(aal)x  = hoax  + hxa2  + ...  + hm_2am_x  + hm_xT'o  , 

2.  Clock  2:  RG,  — aa2, 

{aa‘)2  = hoa2  + hxa 3 + . . . + hm_2fo  + hm_xfx  , 

m- 1.  Clock  m-1: 

RG,  - aam~{  , 

(aal)m_x  =hoam_x  + hxT0  + ...  + h„^2Tm_x  + hm_xTm_ly 

OUTPUT:  (aa  ),■  is  sequentially  available  along  the  clock  / ( Serial-Out ).  ♦ 

In  this  design  approach,  a generic  dual  basis  multiplier  is  depicted  in  Figure  3. 10.  The 

components  a\  of  a is  simultaneously  loaded  to  the  register  at  the  first  clock,  then  the  compo- 
nents of  aa  are  obtained  sequentially.  For  instance,  at  clock  i the  ith  component  of  acd  is 
obtained  by  summing  the  current  contents  of  the  register  by  a d level  adder  tree  adder  (ADD). 
The  level  d in  this  case  is  equal  to  f" log/1 , j is  the  number  of  nonzero  hx , and  the  number  of 
adders  required  to  implement  ADD  is  therefore  j—  1.  Combined  the  delay  factor  in  the  feed- 


-59- 


Figure  3.10  A Dual  Basis  Summed  aal  -Multiplier 

back  network,  the  system  clock  cycle  is  decided  by  the  greater  delay  level  of  the  feedback 
network  and  ADD.  This  architecture  is  simple  and  less  hardware.  It  is  suitable  for  sequential 
digit  applications.  However,  the  d level  delay  will  significantly  slow  the  system  clock  as  j 
becomes  large. 

Both  aal  multipliers  discussed  above  can  be  used  to  calculate  the  product  of  a and  b 

where  a,  be  GF(  Pm ).  a is  represented  in  dual  basis  as  shown  in  Equation  (3.28),  while  b is 
represented  in  primal  basis  as 

b = /jo  + h\a  + ...  + hm_2am~2  + , 

then 

ab  = aho  + a/qa  + . . . + ahm_^am~2  + a/z^a™-1 . 

This  is  exactly  the  form  of  Equation  (3.27).  Thus,  a aft-multiplier  can  be  realized  by  applying 
the  same  design  in  Figure  3.9  or  3. 10  that  a is  represented  in  dual  basis,  b is  in  primal  basis 
and  the  product  is  represented  in  dual  basis  form.  Another  implementation  of  aft-multiplier  is 
similar  to  the  primal  basis  nested  multiplication  algorithm  in  Algorithm  4.2.  A generic  dual 
basis  aft  multiplier  is  also  developed  and  shown  in  Figure  3.11.  The  architecture  of  this  de- 


-60- 


Tr(aam)  feedback  network 


f 

— RG0  L. — { ©f*-'  ► 

t X 

- RGi  ...  RG^j  -* — 0 m-*- 

- RG»-i  ■* — j 0L«J 

CK,  (at)  ) 

[ 

b0  bx 

bm- 1 — ^ 

Figure  3.11  A Dual  Basis  Nested  ab  Multiplier 


sign  is  simpler  and  less  hardware  than  those  shown  in  Figures  3.9  and  3. 10.  Besides  the  pro- 
cessing clock  cycle  totally  depends  on  the  feedback  network. 

It  is  unfortunate  that  two  different  bases  are  involved  in  the  system  since  to  actually  use 
it  as  part  of  a larger  device  it  would  in  general  be  necessary  to  have  circuitry  to  change  basis. 
However,  for  some  bases,  there  exist  some  kind  of  self-dual  properties,  under  such  proper- 
ties, the  bases  change  is  not  thing  but  permutation  of  coordinates  [McE87]. 

3.7  Normal  Basis  Multiplier 

Massey  and  Omura  [Wan85]  invented  a multiplier  which  performs  the  product  of  two 
elements  in  the  finite  field  GF(  2m ).  In  the  normal  basis  representation  the  exponentiation  by 

powers  of  p of  an  element  in  GF(  Pm ) is  readily  shown  to  be  a simple  cyclic  shift  of  its  digits. 
Multiplication  in  the  normal  basis  representations  requires  for  any  one  product  digit  the  same 
logic  circuitry  as  it  does  for  any  other  product  digit. 

It  is  well  know  [Mac77]  that  a normal  basis  Nn  exists  in  any  finite  field  GF(  Pm ).  Let 

ft  € GF(  P ),  then  one  can  find  a normal  basis  Nn  such  that  fi  is  uniquely  expressed  as 


P = boa  + b\ap  + b^aP2  + . . . + bmr.\afr'x , 


-61  - 


where  bt  e GF(Pm),  z=0,  1,-  • •,  m-1.  According  to  the  binomial  theorem  and  the  Fermat 
theorem  that 

a^"_1  = 1 mod  pm 
and 

bP~X  = 1 mod  p 
the  pth  power  of  P becomes 

Pp  = \$fiP  + y\apl  + tf2 af>i  + . . . + tf’m_xafr 

= br^\a  + boap  + b\apl  + ...  + bm-.iqp . 

Hence,  elements  in  GF(  Pm ) raised  to  powers  of  p can  simply  be  realized  by  logic  circuitry 
which  accomplishes  cyclic  shift  in  a register.  A cyclic  shift  register  for  power  forming  in  fi- 
nite fields  is  shown  in  Figure  3.12. 


RG0 

RGi 

bm-1 

RG^2 

RG^i 

bo 

b\ 

bm-1 

Figure  3.12  A Cyclic  Shift  Register  for  Power  Forming  in  GF  ipf1) 

Let  vectors  P = [ bo,  b\,  . . . , bm_\  ] and  y = [ Co,  C\,  . . . , cm_i  ] be  two 

elements  of  GF(  Pm ) in  a normal  basis  representation.  Hence,  the  last  term  dm_x  of  the  prod- 
uct 

<5  = py  = [ d0,  d\,  . . . , dm-i  ] 

is  some  function  of  the  components  of  P and  Y which  can  be  defined  as  Fsb  function. 
Therefore, 


-62- 


dm-i  = F,xb(  bo,  b\,  , bm.\  ; cq,  C\,  ....  cm-\  ).  (3.33) 

Since  the  exponential  by  powers  of  p means  a cyclic  shift  of  an  element  in  a normal  basis 
representation,  one  has 

6P  = ft3  yP 

= [ bm_ i,  b0,  ...  , bm_ 2 ] [ cn_\,  cq,  . . . , cn_ 2 ] 

[ i,  (Iq , . . . , dfn_2  ] . 

Thus,  the  last  component  dm_ i of  bP  is  obtained  by  the  same  F^b  function  in  Equation 
(3.33)  operating  on  the  components  of fiP  and  yp  , that  is 

dm— 2 F \:B ( bm_\,  bo,  . . . , bm_2  , C/n— 1>  C()>  • • • » Cm—  2 ) • 


By  repeating  above  procedure,  the  product  operation  simply  requires  one  logic  function  F.vb 
of  2m  component  of  /?  and  y to  sequentially  compute  the  m components  of  the  product. 
Figure  3.13  illustrates  the  general  logic  diagram  of  the  normal  basis  multiplier  where  the  gate 


Figure  3.13  A Normal  Basis  Multiplier 


array  is  a VLSI  implementation  of  F^b  function. 


L 


CHAPTER  4 

FINITE  STRUCTURE  TRANSFORMS 


4.1  Introduction 

Finite  digital  convolution  is  a numerical  procedure  which  has  many  very  powerful 
applications.  It  is  used  to  implement  finite  impulse  response  (FIR)  and  infinite  impulse  re- 
sponse (HR)  digital  filters;  to  carry  out  auto  and  cross  correlation;  as  well  as  to  perform  the 
computations  such  as  polynomial  multiplication  and  multiplication  of  very  large  integers 
[Aga75].  There  are  several  methods  to  implement  finite  convolution  that  differ  in  the  amount 
of  computation  required,  the  effects  of  arithmetic  round-off,  and  the  amount  of  storage  re- 
quired. It  is  somewhat  difficult  to  compare  various  algorithms  because  of  the  trade-offs  in 
hardware  and  software  implementations.  However,  because  of  the  complexity  of  performing 
multiplication,  the  number  of  multiplications  necessary  to  implement  convolution  is  often  an 
important  factor  to  be  minimized. 

The  use  of  the  cyclic  convolution  property  (CCP)  of  discrete  Fourier  transforms 
(DFT’s)  can  reduce  the  computational  complexity  of  finite  convolution,  only  when  some  fast 
Fourier  transform  (FFT)  algorithms  are  available  and  applied  to  the  DFT’s.  However,  the 
major  disadvantage  of  this  approach  is  in  the  form  of  significant  amounts  of  round-off  error 
due  to  the  accumulated  operation  in  the  transform  when  consider  the  transform  performing  in 
complex  number  field,  C. 

Recently,  various  researchers  have  proposed  the  use  of  transforms  over  finite  structures 
using  number  theoretic  concepts  for  error-free,  fast,  efficient  computations  of  finite  digital 
convolutions  of  real  integer  sequences  or  complex  integer  sequences.  The  implementation  of 
these  transforms  involve  the  use  of  modular  arithmetic  ( as  discussed  in  Chapter  2)  and  often 


-63- 


-64- 


depend  on  the  choice  of  the  modulus,  M,  a large  class  of  transforms  exist  that  have  the  CCP. 
By  special  choices  of  the  length  of  sequence,  N,  the  modulus  M,  and  the  value  of  the  trans- 
form factor,  r,  it  is  possible  to  have  transforms  that  need  only  word  shifts  and  additions  but  no 
multiplications,  that  have  an  FFT-type  fast  algorithm,  that  do  not  require  storage  of  complex 
value  r,  and  that  have  no  round-off  errors.  These  transforms  are  called  number  theorem  trans- 
forms (NTT’s).  In  particular,  the  transform  with  a modulus  of  the  form  of  rth  Fermat  number, 
F, , referred  to  as  the  Fermat  number  transform  (FNT),  has  been  described  in  detail  by  Agar- 
wal  and  Burrus  [Aga74a],  [Aga75].  A Mersennne  number  transform  (MNT)  with  modulus  M 
= 2P— 1,  p prime,  has  been  defined  by  Rader  [Rad72].  The  main  disadvantage  of  both  the 
FNT  and  MNT  is  the  requirement  of  a rigid  relationship  between  the  dynamic  range  and  ob- 
tainable transform  length.  There  is  also  a limited  choice  of  possible  wordlengths. 

Transforms  in  finite  structures  also  are  applicable  overextension  fields.  During  the  past 
few  years  there  has  been  strong  interest  among  researchers  [Ree75]  in  transforms  over  the 

second  order  extension  field,  GF(  p ),  which  is  analogous  to  those  used  for  the  complex  field 
C.  The  use  of  these  transforms  offer  a number  of  advantages.  In  many  applications,  such  as  in 
radar  or  communication,  the  sequences  to  be  convolved  consist  of  complex  quantities;  the 
transform  also  allows  greatly  increased  sample  lengths  over  those  defined  in  the  ground 
field,  GF(  p ),  like  those  of  FNT  or  MNT;  for  a sufficiently  large  p,  one  can  use  this  transform 

to  convert  a sequence  of  complex  integers  an  into  the  sequence  Ak  in  GF(  p2)  for  which  the 
inverse  transform  of  Ak  is  precisely  the  original  sequence  of  complex  numbers  an.  Conse- 
quently, filtering  operations  or  convolutions  without  round-off  error  are  obtained  using  this 
transform  on  a sequence  of  complex  integers. 

For  the  higher  order  extension  field,  GF(  pm ),  transforms  exist  as  long  as  the  sufficient 
and  necessary  conditions  exist.  By  applying  finite  field  properties,  such  as  the  conjugacy 
property,  and  the  basis  representation  of  the  field  elements  to  the  existing  fast  transform  algo- 
rithms. A significant  reduction  in  computational  complexity  is  achieved.  There  also  has  been 


-65- 


considerable  work  on  high-speed  residue  number  arithmetic  for  use  in  high  data  rate  digital 
signal  processing.  The  structure  of  a complex  integer  ring  is  generalized  to  form  a complex 
residue  number  system  (CRNS),  which  is  a formidable  computational  number  system,  inde- 
pendent of  any  relation  it  may  have  to  the  NTT’s.  In  this  CRNS,  complex  multiplication  is 
accomplished  by  means  of  a real  index  calculus  (discrete  logarithm)  because  the  selection  of 
the  system  parameters  is  not  overly  constrained  by  the  algebraic  structure  required  by  a NTT. 
The  CRNS  is  only  useful  when  the  complex  variables  are  feasibly  represented  by  the  power 
form  over  a second  order  extension  field  and  computed  using  the  discrete  logarithm  methods 
[Jen80a],  When  complex  variables  are  represented  by  a isomorphic  mapping,  traditionally 
referred  to  as  the  quadratic  residue  number  system  (QRNS),  the  usual  operations  are  reduced 
to  a point-wise  operation. 

4.2  Progenies  of  Transforms  and  Some  Prime  Field  Transforms 


Define 

0(M)  = gcd(  pi- 1,  pi-\),  (4  D 

Note  that  N I gcd(  p\  - I,  pi- 1,  • • •,  pi  - 1) , so  that  a necessary  and  sufficient  condition 
for  the  existence  of  an  jV-point  NTT  is  that 

N\0{M).  (4.2) 

In  practice  it  is  often  easier  to  verify  the  following  three  necessary  and  sufficient  condi- 
tions for  the  existence  of  Appoint  NTT  defined  modulo  a composite  number  M\ 

\)cF  = 1 mod  M, 

2)  NN~X=  1 mod  M, 

3)  gcd(  a1  - 1,  M ) = 1 for  all  / such  that  N/l  is  a prime  number. 

As  an  example,  let  M = p2=  3“  anda  =8.  Then  8 is  a root  of  order  2 modulo  9,  N= 2,  gcd(M, 
N)  = \,  AT1  = 5 mod  9,  and  N I ( p - 1 ). 


-66- 


Cvclic  Convolution  Propenv  (CCP).  If  g(n)  and  h(n),  n = 0,  1,2,...,  AM,  are  two 
periodic  sequences  with  period  N,  their  cyclic  convolution  is  a periodic  sequence  a(i),  i = 0, 1 , 
2, , AM,  with  period  N described  by 
N- 1 

a(i)  = h(i  - n)g(n)  (43) 

n= 0 

If  the  discrete  transforms  T of  the  sequences  g(n),  h(n),  and  a(n)  can  be  related  as 

T[  a(n ) ] = 7T  h(n ) ] T[  g(n)  ],  (4.4) 

then  the  transform  has  the  cyclic  convolution  property  (CCP).  Hence  the  CCP  states  that  the 
transform  of  the  cyclic  convolution  of  two  sequences  is  the  product  of  the  transforms  of  the 
two  sequences.  Certainly,  the  DFT  has  this  property  and  that 

a{n)  = IDFT{  DFT[  h(n)  ] DFT[  g(n)  ] }.  (4.5) 


DFT  Suwmrg,  If  an  AZ-point  sequence  x(n)  and  its  transform  X(k)  can  be  related  by  a 
transform  pair 

,v-i 

*(*)  = X x(n)ank  , k = 0,  1, . . . ,N-  1,  (4.6) 

n=0 


and 


,v-i 

x(n)  = 2 X(k)a 

k=0 


-nk 


/t  = 0,  1, . . . , AZ-  1, 


(4.7) 


then  the  transform  whose  basis  functions  area'1*  is  said  to  have  a DFT  structure.  In  this  case 
both  the  forward  and  inverse  transforms  have  similar  operations.  In  Equation  (4.7),  AT1  rep- 
resents the  multiplicative  inverse  of  N in  the  field  in  which  the  arithmetic  is  carried  out.  An 
/V-point  transform  having  the  DFT  structures  has  the  CCP  — provided  N~l  exists,  and  a is  a 
primitive  root  of  order  N.  When  all  the  transform  operations  are  carried  out  in  a field  of  inte- 
gers modulo  M,  the  transform  belongs  to  the  NTT.  In  a ring  of  integers  Zm  , as  -k  = M-k  mod 
M,  conventional  integers  can  be  uniquely  represented  only  if  their  absolute  value  is  less  than 


-67- 


A//2.  Since  the  convolution  is  implemented  in  modular  arithmetic,  so  long  as  the  magnitude 
of  the  convolution  of  two  sequences  does  not  exceed  Ml 2,  the  NTT  can  yield  the  same  result 
as  that  obtained  using  conventional  arithmetic. 

Although  there  are  a large  class  of  NTT  that  can  implement  cyclic  convolution,  only  a 
few  of  them  are  computationally  efficient  when  compared  to  the  DFT  and  other  techniques. 
Three  constraints  dictate  the  selection  of  NTT  for  discrete  convolution: 

1)  N should  be  highly  composite  so  that  the  NTT  may  have  a fast  algorithm,  and  it  should  be 
large  enough  for  application  to  long  sequence  lengths; 

2)  Multiplication  by  powers  of  a should  be  a simple  operation.  If  a and  its  powers  have  a 
simple  binary  representation,  then  this  multiplication  reduces  to  bit  shifting; 

3)  To  simplify  modular  arithmetic,  M should  have  property  2)  and  should  be  large  enough  to 
prevent  overflow; 

4)  Another  constraint  on  the  NTT  is  that  the  word  length  of  the  arithmetic  be  related  to  the 
maximum  length  of  the  sequence.  Forexample,  for  the  FNT,whena  = Jl , N = 2t+1  = 4b  = 4 
times  the  word  length.  When  a = 2,  N = 2,+l  = 2b  = 2 times  the  wordlength.  This  constraint 
can,  however,  be  minimized  by  adopting  multidimensional  techniques  for  implementing 
one-dimensional  convolution  [Aga74a]. 

Selection  of  the  modulus  M,  sequence  length  N,  and  the  order  of  a modulo  M is  based 
on  meeting  the  above  constraints  so  that  the  efficient  NTT  can  be  developed.  For  example,  if 
M is  even,  then  by  [Aga74b]  the  maximum  possible  sequence  length  is  1 , a case  of  no  interest, 
when  M is  a prime  number,  Nmax  = M-  1.  Finally,  when  M = 2*  - 1 and  k is  a composite 
number  k = PQ,  where  P is  a prime  number  and  Q is  not  necessarily  a prime  number,  then 
( 2P  - 1)  I ( 2pQ  - ! ), 


and  A/max  = 2P  - 1. 


(4.8) 


-68- 


Fermat  Number  Transform  (FNT).  \iM  = 2k  + 1 and  k is  odd,  then  3 I (2*  + 1).  Hence 
Nmax  = 2.  When  k is  even  and  k = s2' , where  s is  an  odd  integer  and  t is  an  integer, 

( 22'  + 1)  I ( 2s2'  + 1 ),  (4.9) 

and  the  sequence  length  is  governed  by  22'  +1.  For  integers  of  the  form  M = 22'  + 1,  called 
Fermat  numbers,  the  NTT  reduces  to  the  Fermat  number  transform  (FNT).  F,  is  the  n h Fer- 
mat number,  defined  as 

F,=M  = 22'+l.  (4.10) 

Of  all  the  Fermat  numbers  only  Fq  to  F 4 are  prime.  The  FNT  and  its  inverse  can  be  defined  as 

N- 1 

*/*)  = [ X ] mod  Ft,  k = 0,  1.....N-1,  (4.11) 

n= 0 

AM 

x(n)  = [ AT'  £X/*)a~*  ] mod  F,  , „ =0,  1 N-  1,  (4.12) 

fc=0 

where  A/  is  the  order  of  a modulo  such  that  = 1 mod  Ft . 

Mersenne  Number  Transform  (MNT).  Mersenne  number  are  the  integers  given  by 
2 — 1 where  P is  prime.  When  M is  a Mersenne  number  the  NTT  is  called  the  Mersenne  num- 
ber transform  (MNT).  Whena  = -2,N=  (Vmax  = 2/>.  When  a =2,N=P,  since  2P  =M+  1 = 1 

mod M.  Mersenne  numbers,  denoted  here  Mp,  are  1,3,7,31, 127,2047,8191 Fora  =2 

the  MNT  and  its  inverse  can  be  defined  respectively  as 

/M 

*«(*)  = [ 2 ] m°d  Mp  , k = 0,  1,  1,  (4.13) 

rt= 0 

P-l 

xM  = [ p->  Xxm(*)2-"*  ] mod  M,  , „=0,  1 P-l,  (4.14) 


where  P~x  = Mp  - (Mp  - 1)  / P. 


-69- 


Rader  [ Rad72]  has  shown  that  the  MNT  satisfies  the  CCP  and  has  discussed  hardware  imple- 
mentation for  the  MNT. 

Rader  Transform  (RT),  The  Rader  transform  is  a special  case  of  the  NTT.  For  any  Fer- 
mat number,  2 is  of  order  N = 2b  = 2'+1  ; that  is,  22*  = 1 mod  Ft . Whena  is  any  power  of  2 all 
the  multiplications  by  o'*  become  bit  shifts  and  the  FNT  can  be  computed  very  efficiently; 
for  the  case  a = 2,  both  the  FNT  and  MNT  are  called  the  Rader  transforms.  When  N is  an 
integer  power  of  2,  the  Rader  transform  can  be  implemented  by  a radix-2  FFT-type  algo- 
rithm. Substituting  2 for  the  multiplier  w = exp(-j2n/N)  in  the  FFT  flowgraph  yields  the  fast 
algorithm  for  the  Rader  transform. 

4-3  Complex  Fields  Transforms  and  Complex  Arithmetic 

A multitude  of  important  signal  processing  and  communication  applications  such  as 
the  computation  of  FFT  and  auto-correlation  or  cross-correlation  of  multiphase  communica- 
tion signals  require  complex  transforms  and  a large  number  of  complex  multiplications. 
Thus,  the  NTT  in  complex  field  and  a low  complexity  complex  number  multiplication 
scheme  are  introduced  in  the  following  sections. 

Digital  filtering  of  complex  signals  or  cyclic  convolution  of  complex  sequences  can  be 
accomplished  in  a complex  integer  field.  In  a ring  of  complex  integers,  ZCM , all  the  arithmetic 
operations  are  performed  as  in  normal  complex  arithmetic  except  that  both  the  real  and  imag- 
inary parts  are  evaluated  separately  mod  M.  The  set  c;  = aj  +ibj,  aj,  bj  = 0, 1,  M-  1, 

where  aj  = Re[cy  ] and  bj  = Im[c;  ],  represents  . All  complex  integers  are  congruent 

modulo  M to  some  complex  integer  in  this  set.  Complex  convolutions  arise  in  many  fields, 
such  as  radar,  sonar,  and  modem  equalizers. 


-70- 


4.3.1  Complex  Number  Theoretic  Transform  (CNNT). 

There  are  two  distinct  situations,  which  are  handled  quite  differently,  depending  on 
whether  or  not  the  square  root  of-1,  denoted  by  /TY  , exists  in  GF(  p ).  If  p is  a Mersenne 
prime,  then  /^T  does  not  exist  in  GF(p);ifp  is  a Fermat  prime,  then does  exist  in  GF( 
P)- 

Complex  MNT,  For p = 2m  - 1,  a Mersenne  prime  Mp,  where  m is  an  odd  prime,  J~ 
does  not  exist  in  this  field  GF(  Mp).  Thus,  we  extend  the  field  to  GF(Mp)  in  the  same  way  that 
the  real  field  R is  extended  to  the  complex  number  field  C.  The  Galois  field  Fourier  trans- 
form  over  GF (Mp)  can  used  to  compute  convolutions  in  the  complex  field. 

In  the  Galois  field  GF (Mp),  the  polynomial  x2  + 1 has  no  zeros.  Hence,  the  field  is  ex- 
tended by  adjoining  an  element  called  j and  forming  the  set  GF(  Mp  ) =*  F2-_  x (j)=(  a + jb  } , 
where  a and  b are  elements  of  GF (Mp). 

As  shown  in  Equations  (4.13)  and  (4. 1 4),  the  MNT  transform  pair  can  be  utilized  in  the 
complex  number  case  except  that  the  input  sequence  length  d is  Mp  — 1 or  a divisor  thereof; 

the  transform  factor  a is  an  element  of  GF  (Mp)  of  order  d.  By  looking  more  detail  into  the 
length  d that  we  have  the  factorization 

Mp-  1 = (2m-  l)2  - 1 = 2m+1  (2m_1  - 1 ) 

Hence,  in  GF (Mp)  we  can  choose  2m+1  or  some  factors  of  2m_1  - 1 as  the  data  length  of  the 
transform.  In  the  case  of  d is  a factor  of  two,  the  transform  can  be  computed  by  a radix-two 
Cooley-Tukey  FFT This  transform  have  more  choices  for  the  data  length  than  in  MNT  with- 
in the  finite  field  GF(Mp). 


-71  - 


Complex  FNTi  In  the  case  of p a Fermat  number,  Ft , it  is  not  possible  to  form  an  exten- 
sion Field  GF(  p2)  with  a multiplication  rule  that  behaves  like  complex  multiplication  since 
rr  is  an  element  of  GF(p  ).  Specifically,  T = 2m/4(2 m/2  - 1). 

^■3.2  Complex  Residue  Number  System  (CRNSt 

The  execution  of  very  high  speed  complex  arithmetic  is  important  in  spectral  analysis 
and  in  the  processing  of  complex  baseband  waveforms  that  result  from  quadratic  demodula- 
tion in  radar  and  communication  systems.  For  example,  modem  synthetic  aperture  radars 
operating  in  the  spotlight  mode  require  polar-to-rectangular  format  interpolation  filters  that 
can  operate  in  real  time  as  an  aircraft  flies  by  the  scene.  These  interpolation  filters  are  ideally 
spatially  varying  complex  FIR  filters  of  relatively  low  order.  It  appears  that  this  type  of  re- 
quirement is  well  suited  for  a complex  RNS  realization. 

If  the  field  GF(p ) does  not  contain  a square  root  of -1,  then  GF  (p)  can  be  extended  to 
GF(  P ) m the  same  way  that  the  real  field  is  extended  to  the  complex  number  field.  For  ex- 
ample, 6 = — 1 mod  7 in  GF(  7 ),  and  it  is  easy  to  check  that  there  is  no  element  in  GF(  7 ) whose 
square  is  6.  Hence,  we  define 

GF(  72)  = GF(  49  ) = { a +jb  : a,  b e GF(  7 ) } 
with  addition  and  multiplication  defined  in  the  same  way  as  for  the  complex  number  field. 

[ ( a +jb ) + ( c +jd ) ] modp  = [ (a  + c ) +j(  b + d)  ] mod/?  (4.15) 

[ ( a +jb  ) ■(  c +jd)  ] mod/?  = [ (ac-bd)  +j(ab  + cd)  ] mod p.  (4.16) 
With  these  definitions,  GF(  p~)  is  a field.  The  integer  of  GF(p2)  are  the  elements  of  GF(  p ), 
so  GF(p  ) has  characteristic  p. 

4-3-3  Quadratic  Residue  Number  System  fORN.St 

In  a conventional  residue  number  system,  such  as  CRNS,  a complex  multiplication  re- 
quires four  real  multiplications  and  two  real  additions  per  moduli.  The  quadratic  residue 


-72- 


number  system  (QRNS)  changes  this  requirement  significantly.  This  system  is  defined  with 
an  isomorphic  mapping  originally  suggested  in  the  literature  by  Vanwormhoudt  [Van78]  and 
later  on  by  Leung  [Leu81]  and  Krogmeier  and  Jenkins  [Kro83]. 

In  a prime  field  GF(  p ),  those  elements  that  have  a square  root  are  called  quadratic  resi- 
dues (because  they  are  the  squares  of  their  square  roots  modulo  p).  Exactly  half  of  the  non- 
zero elements  in  GF(  p ),  p an  odd  prime,  have  square  roots.  To  see  this  (refer  to  Table  2.3), 
first  note  that  every  even  power  of  a primitive  elementa  has  a square  root.  On  the  other  hand, 
every  element  that  is  a square  root  can  be  written  as  an  for  some  n,  and  so  its  square  is 
a2n  modp-1 > sjnce  the  multiplicative  group  of  the  field  is  cyclic  with  p - 1 elements.  But/?—  1 is 
even,  so  ( 2 n)  mod  (p  - 1)  is  even  as  well.  Hence  only  even  powers  of  a can  have  square 
roots.  For  the  finite  field  GF(17)  in  Table  2.3,  there  are  16  nonzero  elements.  Eight  of  these 
elements  are  quadratic  residue,  which  have  square  roots.  They  are  the  even  power  of  the 
primitive  element  3,  which  are  9, 13, 15, 16,8,  4,2, 1.  For  the  application  in  complex  fields, 
we  concentrate  on  the  case  of  the  quadratic  residue  -1  or  p - 1. 

Consider  a modulus  channel  with  modulus  M and  a complex  integer  z -x  + iy  of  , 

where  x,  y e Zm  , i = . Then  the  quadratic  RNS  mapping  is  an  isomorphic  mapping  Fi 

ofZw  onto  the  external  direct  product  Zm  x Z m = Z^.  This  isomorphic  mapping  satisfies 
the  following  equation. 


z = x + iy 


( Zq  , Z\  ). 


(4.17) 


The  input  and  output  parameters  x,  y and  Z0 , Z\  satisfy 


Zo  =(x+Jy)  mod  M, 

Zi  = ( x -jy  ) mod  M\ 
x = (2_1(Zo  + Z\ ))  mod  M, 
y - (2_1y'~1(Zo  - Z\))  mod M. 


(4.18) 


(4.19) 


-73- 


where  x,  y and  Zq,  Z\  e Z v/ , j is  a quadratic  root  of— 1 in  Zv/  ,2  1 and  j~ 1 are  the  multiplica- 
tive inverses  of  2 and;  mod  M,  respectively.  The  structure  Z.vr  is  called  the  QRNS  and  is  a 
finite  ring  consisting  of  M2  elements. 

The  rules  of  composition  in  Zv/  are  as  follows. 

1)  Addition: 

( Zoo,  Zoi  ) + ( Zio,  Z\\  ) = [ (Zoo  + Z\q)  mod  M,  (Zoi  + Z\\ ) mod  M ].  (4.20) 

2)  Multiplication: 

( Zoo,  Zoi  ) ( Zio,  Z ii  ) = [ (Zoo  • Zio)  mod  M,  (Z0 1 • Z\\)  mod  M ].  (4.21) 

From  these  rules,  we  see  that  the  multiplication  in  the  new  domain  is  simply  a component- 
wise operation. 

Again,  for  the  QRNS  mapping  F2  to  exist,  the  quadratic  congruence,  x2  = -1  mod  M, 
must  be  solvable.  The  necessary  and  sufficient  condition  for  the  quadratic  roots  of-1  in  Z^ 
is  that  M = 4k  + 1 , where  k is  a positive  integer.  Furthermore,  two  roots,y  and  f,  are  mutually 
additive  and  multiplicative  inverses  mod  M,  that  is  / = -j  = j~l  mod  M [Har65,  Lip81], 

4-3-4  Extended  Algebraic  Integer  and  Polynomial  Residue  Number  System  (PRNS1 

The  principle  advantage  of  RNS  processing  is  its  ability  to  reduce  a complex  multipli- 
cation or  addition  to  the  calculation  of  a single  integer  multiplication  or  integer  addition  mo- 
dulo a prime  in  parallel  sets  of  residue  channels.  When  the  primes  are  small  the  implementa- 
tion complexity  of  the  computation  in  each  of  the  residue  channels  is  correspondingly  small 
— implying  substantially  high  throughput. 

To  work  in  an  RNS  the  complex  roots  of  unity  needed  in  the  computation,  and  also  the 
input  data,  must  first  be  quantized.  This  quantization  is  conventionally  performed  by  scaling 
and  then  rounding  to  the  nearest  Gaussian  integer.  This  method  introduces  errors  that  are  de- 
pendent on  the  scale  factor.  The  large  scale  factors  that  constrain  the  quantization  error  also 
impose  a large  dynamic  range  requirement  on  the  RNS,  thus  offsetting  its  advantage. 


-74- 


Instead,  Cozzens  and  Finkelstein  [Coz85]  introduced  an  idea  of  performing  the  compu- 
tation in  a certain  ring  of  algebraic  integers  that  contain  the  Gaussian  integers.  This  is  accom- 
plished by  approximating  the  appropriate  complex  roots  of  unity  by  elements  of  the  ring. 
Games  [Gam85]  illustrated  a method  to  perform  this  approximation  by  elements  of  the  alge- 
braic integers  of  Q (co  ).  Due  to  the  nature  of  these  approximations,  the  dynamic  range  of  the 
computation  is  dramatically  reduced. 

Algebraic  Integer  Concepts.  To  reduce  the  range  requirements,  Gaussian  integers  u + 
jv,  with  u and  v small,  can  be  used  to  approximate  only  the  arguments  of  the  mh  roots  of  unity. 
The  approach  of  Gaussian  integers,  which  are  the  algebraic  integers  of  the  field  Q(  i ),  was 
replaced  by  approximations  using  algebraic  integers  in  higher  degree  extensions  of  Q.  Typi- 
cally, the  role  of  i = exp(  j2n/4),  a primitive  fourth  root  of  unity,  is  replaced  by  co  = exp( 
j2n/8)  or  co  = exp(y2rc/16).  while  algebraically  similar,  the  algebraic  integer  Z[co  ] £ Q(co ) 

can  differ  from  Z[i]  dramatically  in  one  sense  that  Z[  co  ] can  be  dense  in  C so  that  arbitrarily 
good  approximations  can  be  obtained. 

Let  co  be  a primitive  mh  root  of  unity  in  C with  n > 2,  for  instance  co  = exp(y'27t/n).  It  is 
known  [Niv80,  Chapter  2]  that  co  satisfies  an  irreducible  monic  polynomial  C„(je)  of  degree 
<p  {n),  and  this  cyclotomic  polynomial  is 

Cn(x)  = x - <y‘  ) t (4.22) 

i 

where  the  product  is  taken  over  positive  integers  / < n that  are  relative  prime  to  n.  Similar  to 
the  concept  of  constructing  finite  extension  fields  that  discussed  in  Chapter  3,  the  subfield  of 
C obtained  by  adjoining  co  to  the  rational  numbers  Q,  denoted  by  Q(<u  ),  has  , as  a vector 
space  over  Q,  dimension  <f>  ( n ).  In  fact, 

Q(co ) = ( a0  + aico  + . . . + : a,  e Q }.  (4.23) 


-75- 


Since  we  deal  with  finite  structures  in  this  dissertation,  we  concern  with  the  subset 
Z[w]cQ(w)  defined  by 

z [<D  ] = {zo  + Zi<y  + . . . + Z^(„)_1£U*(,,)-1  : z,  e Z ).  (4.24) 

Z[a>  ] forms  a ring  under  complex  addition  and  multiplication,  where  the  rule  implied  by 
Cn(a)  ) = 0 is  used  to  reduce  powers  of  co  larger  than  <p  (n)- 1.  Also  any  element  ze  Z[co  ] is 
a root  of  a monic  polynomial  (minimal  polynomial)  xf1  + cm_ ix"*-1  + . . . + co  with  coeffi- 
cients c(  e Z,  and  degree  m > 1.  Furthermore,  Z[(o  ] is  exactly  the  subset  of  Q(co ) that  has 
elements  with  this  property.  As  such,  Z[  to  ] is  called  the  ring  of  algebraic  integers  of  Q(  co  ).  It 
is  well  known  that  when  n = 4,  Z[co  ] is  just  the  Gaussian  integers. 

Hz  = x+jy  = zo  + z\ co  + ...  + z^(n>-ito^(n)-1  e Z[co  ],  then  z is  represented  either 
by  its  complex  coordinates  ( x,  y ),  thought  of  as  a point  in  R2 , or  by  Z[co  ] coordinates  [ z0  , 
z\ , , z^(„)_ i ].  The  rules  of  composition  of  two  elements  u and  v of  Z[co  ] when  added 
yield  w + v = [w0  + v0,w1+  vj u<fi{n)_i  + v^(„H  j.Ifwte  Z.then  mu  = [muo,mul,..., 
mu<t>(ny- 1 ]•  With  multiplication  defined  by 


where 


UV  = C = ^ CjCt)1 

i=o 


k <p(n)~  1 

= X Gk-Jbj  - X ajb<p(n}~j+k 


(4.25) 


y=0 


Observe  that  multiplication  in  Z[co  ] is  similar  to  the  circular  convolution  of  the  two  se- 
quences u and  v. 


Algebraic  Integer  Extension  Approximations.  In  this  presentation  we  demonstrate  a 
system  based  on  a complex  extension  of  degree  4.  Let  co  = exp(  ;2jt/8)  be  the  primitive 
eighth  root  of  unity.  Then  co  satisfies  the  equation,  generated  by  the  cyclotomic  polynomial 


-76- 


Cjj (x),  <y4  + 1=0.  The  cyclotomic  fields  Q( a) ) has  the  form  of  Q(<u  ) = ( a0  + a\0)X  + 
a^co2  + fl3<y3  : at  e Q } . Its  respective  ring  of  algebraic  integers,  is  defined  by  Z[  (o  ] = { z0  + 

zitu1  + 12C02  + zyco3  : z,-  € Z }.  An  element  z of  Z[w  ] is  represented  by  [ z0  , zi  , z2  . z3  ]. 

The  representation  of  a complex  number  t by  Z[  (1)  ] coordinate  is  demonstrated  in  Figure  4. 1 . 


jy 


t = x + iy  = 4 <u°  + 3a*1  + 2a)2  + lco3 
= 5.4142  + 14.82843 
= [ 4,  3,  2,  1 ] 

Figure  4.1  The  Representation  of  Complex  Number  1 in  Z\o] 


For  an  arbitrary  complex  number  t-x  +Jy,  where  x,y  e R,  an  algebraic  integer  s can  be 
chosen  to  approximate  r.  Since  co  = (/2/2  ) + i(  /2/2  ) such  that 


J2 

x = a0  + —{a\  - a3)  = xq  + x\Jl 

L 1 


Ji, 

y - a 2 + ~{a  1 + a3)  = yo  + y\  /2 


where  ,r0  - oq,  yo  = *1  = (at  - a3)/ 2,  and  yi  = (a!  + a3)/2. 


-ti- 


ll is  obvious  that  the  problem  of  approximation  of  complex  numbers  by  an  algebraic 
integer  (o  is  equivalent  to  the  approximation  of  real  numbers  R by  /2  - In  fact,  we  propose  the 
one-dimensional  approximation  to  reduce  the  complexity  of  two-dimensional  approxima- 
tions, which  suggested  by  Games  [Gam85].  In  the  case  of  finding  integers  x0  and  xi  to  ap- 
proximate J2  . we  define  the  approximation  error  e = I fl  - (xo  !x\ ) I.  where  xq  and  jq  are 
integers.  According  to  the  mean  value  theorem,  it  follows  that 
fix)  - fix o/xi) 


x - (xoAi) 


= f(ri) 


(4.26) 


for  some  rj  e [ fl , (xq  lx\ ) ] and/(x)  = x2  - 2.  Letting  x = fl , the  Equation  (4.26)  becomes 
J2  -ixo/xl)  = -fixo/xl)ifivTl). 

Therefore, 

e =1/2  - (x0/jc!  ) I = I ((x§/x?)  - 2V  (2*7  ) I. 

Thus,  the  approximation  error  € becomes 

e = ( I (x§  - 2xj  )/xf  l)(l  l/(2i7  )l)>l/(3xf  ) 

and  also 

e <(  I (Xq -2xj)/xf  I ) ( 1/2  ) < ( 2/xj  ) ( 1/2  ) = ( 1/xf  ). 

Thus,  there  exists  a range  for  the  approximation  error  e 

l/(  3xj  ) < e < l/x\  . (4.27) 


Polynomial  Residue  Number  Systems  (PRNS).  The  ring  of  Gaussian  integers  modulo 
M,  denoted  as  forms  the  basis  of  QRNS  with  M of  the  form  4k  + l.Itiseasy  to  see  that 

the  product  of  two  complex  numbers  is  equivalent  to  the  product  of  two  first  order  polyno- 
mials taken  module  (x2  + 1 ).  Extending  the  QRNS  idea,  we  introduce  the  algebraic-integers 
of  higher  degree  to  achieve  a better  approximation  to  the  complex  numbers. 


-78- 


The  product  of  two  (N-l)-order  polynomials  modulo  x jV  + 1 over  some  modular  ring 
Z m define  a residue  number  system  of  order  N.  To  obtain  the  lowest  possible  multiplication 
counts  within  the  prime  field,  the  Nth  order  congruence,  h -1  mod  M,  must  be  solvable. 
That  is,  the  polynomial  + 1 must  be  factored  in  N distinct  factors  in  Zv/  as 

xN  + \ = (z  - r0)  (z-r{)  . . . (z-rN_i)  mod  My  (4.28) 

with  ro,  ri,...,  r/v_  ie  Zm  . The  necessary  and  sufficient  condition  for  the  Mh  order  roots  of 
-1  to  exist  in  the  integer  ring  Zw  isN  I (M-  l)/2  [Ska87,  Lip81].  As  a result,  there  exists  an 
isomorphic  mapping  F n of  z(x)  of  Zw[  i ] onto  the  external  direct  product  ZM  x Zw  x . . . x 
Zm  = Z y,  where 

z(x)  = z0  + z{xl  + ...  + zN_  tO^-1,  (4.29) 

with  x = exp(y2:c/A0  and  is  a primitive  Mh  root  of  unity  in  C,  and  z;  e Zw,/  = 0, 1.....1V-1. 
This  is  called  the  Polynomial  Residue  Number  System  (PRNS).  The  PRNS  isomorphic  map- 
ping satisfies  the  following  relationship: 

z(x)  = zo  + zxx1  + ...  + Z/V-t-v  _1  is  (Zo,  Zu  , ZN.\  ) . (4.30) 

Fn 

The  input  and  output  parameters  z, , Z,  satisfy  the  equations 

Zi  = z(x)  mod  (x  - r,)  = z(r,)  mod  /V/  (4.31) 

and 

N-l 

z(x)  = (^ZiQAx))  mod  (x^+  1),  (4.32) 

1=0 

where  z,- , Z,  , n e ZM , t = 0,  1 N-l,  and 

<2i0c)  = IV'1  ( 1 + rr1^  + r'2*2  + . . . + r^-V-l  ). 

The  mapping  also  can  be  represented  in  vector-matrix  forms  as 


-79- 


ZT  = Fn  zt  (4.33) 

and 

zT  =F~sZr,  (4.34) 

where  z = [ zq  , Z\ , . . . , z/^_i  ] and  Z = [Zq,Z\,...,  Z/v_i  ] are  vectors.  The  matrix  forms  F,v 
and  Fjv  of  the  isomorphic  mapping  are 


and 


Fn  = 


1 r0  rg 

1 ''l  r} 


1 rAM  r$_{ 


(4.35) 


= A^1 


1 1 


ro 


„-i 


rN- 1 


r-0V-l)  -(AM)  -(AM) 

L ro  r\  rN- 1 


(4.36) 


The  rules  of  composition  in  Zw  x x . . . x = Z»  are 


1)  addition: 


(Zoo,  Zoi , . . Zon-i)  + (Zio,  Z\\ , . . Z i^v_i ) 

= [ ( Zoo  + Zio ) mod  M,  (Z0i  + Zn  ) mod  M, , (Zqam  + Zuv-i ) mod  M ];  (4.37) 
2)  multiplication: 

(Zoo  , Zoi , . . Zo/v-i ) (Zio , Zn  , . . Zi^i) 

= [ ( Zoo  Zio ) mod  M,  (Zoi  Zn  ) mod  M, ... , (Z0N_ i Zw_i)  mod  A/  ],  (4.38) 

where  Z,y  e Z^ . From  these  rules,  multiplication  in  the  new  domain  is  simply  a component- 
wise operation;  and  as  such,  it  requires  only  N operations  for  an  N-tuple  data  stream  as  op- 
posed to  N 2 operations  necessitated  by  the  conventional  approach.  This  computational  com- 


-80- 


plexity  reduction  is  encouraging.  However,  the  main  concern  involves  isomorphic  mappings 
which  are  discussed  in  later  chapter. 

Since  the  complex  arithmetic  of  residue  number  systems  in  DSP  applications  is  a recent 
subject  of  intense  study  [Ree75b,  Jen80,  Kro83],  the  application  PRNS  in  complex  multipli- 
cation seems  to  be  viable  solution  which  offers  a tremendously  low  complexity  in  complex 
operations.  A possible  complex  number  arithmetic  model  is  shown  in  Figure  4.2  . In  this 
model,  analog  signals  are  First  digitized  by  analog-to-digital  (A/D)  converter  and  approxi- 
mated by  algebraic  integers.  PRNS  mappings  are  then  applied  to  the  polynomial  form  of  in- 
put signals.  A low  complexity  component-wise  operation  is  executed  in  the  PRNS  domain. 
The  result  is  mapped  back  to  the  algebraic  integer  polynomial  format  followed  by  a digital- 
to-analog  (D/A)  converter. 

4.4  Transforms  and  Computations  in  Extension  Fields 

All  the  finite  field  transforms  and  the  RNS  systems  discussed  performs  over  a ground 
field  GF(  p ) where  p may  be  a Fermat  or  Mersenne  number,  or  p satisfies  some  specified 
condition  such  as  having  the  form  of  4k  + 1 or  2 Nk  + 1 . Without  meeting  these  requirements, 
transforms  may  exist  in  the  higher  order  extension  field  GF(  pm).  A number  of  previous  dis- 
cussed finite  field  properties  such  as  index  calculus,  conjugacy  of  field  elements,  and  basis 
representation  to  perform  and  expedite  these  transforms  in  extension  fields. 

4.4.1  Index  Calculus  Complex  Residue  Number  System  (ICCRNS1 

ICQRNS,  For  certain  primes,  the  quadratic  equation  x2  = -1  mod  M does  not  have  a 
solution  in  . In  this  case  -1  is  called  a quadratic  nonresidue  mod  M.  It  is  follows  that  a 
solution  to  the  equation  can  be  found  in  the  second  order  extension  field  GF(  M2)  ( or  ) 
and  is  denoted  j = /IT  . Under  these  circumstances  the  set  of  elements  in  GF(M2)  is 

ZMJ  = ZM  (j  ) = { a + j b : a,  b e Zw  },  (4.39) 


Complex  to  AI 
Mapping 


-81- 


•© 


Figure  4.2  The  Pictorial  Representation  of  the  Mh -order  Single  Modulus  PRNS  System 


-82- 


and  addition  and  multiplication  in  the  extension  field  are  defined  by 

[(a  + j b)  + {c  + jd)]  mod  M = [(a  + c)  + ] (b  + d)]  mod  M (4.40) 

[( a + j b )•  ( c + j d)]  mod  M = [ (ac  - bd  ) +j  ( ab  + cd  ) ] mod  M.  (4.41) 

It  is  known  [Har65]  that  for  all  primes  of  the  form  M = Ak  + 3,  -1  is  a quadratic  nonresi- 
due mod  M,  where  k is  an  integer.  This  defines  a set  of  primes  that  can  be  selected  for  forming 
extension  fields  where  the  quadratic  congruence  is  solvable.  For  example,  lent  =1,  then  the 
prime  M = 7 causes  the  extension  field  GF( 72)  of  49  elements  to  have  solution  to  the  quadrat- 
ic equation;  likewise,  for  k = 4,  there  exists  the  extension  field  GF(  192)  of  361  elements. 

1)  Prime  finite  field  index  calculus  CRNS 

Every  prime  field  GF(  M ) has  an  associated  index  set  that  is  analogous  to  a set  of  loga- 
rithms in  the  real  number  system  [ see  Chapter  2 ].  By  using  this  index  set  and  adding  the 
indexes  mod  M- 1 , finite  field  multiplication  can  be  implemented  with  a mod  M- 1 adder  and 
a set  of  index  tables  stored  in  high-speed  memory  which  is  of  the  order  of  ( log  M ).  For  exam- 
ple, the  four  real  multiplications  required  for  the  complex  multiply  of  Equation  (4.41)  can  be 
replaced  by  four  real  additions  and  a nominal  amount  of  memory  storage. 

2)  Complex  extension  field  index  calculus  CRNS 

The  complex  extension  field  GF(  M2)  also  has  a real  index  set  that  is  generated  by  a 
complex  primitive  element  of  order  A/2- 1 . The  existence  of  a real  index  calculus  implies  that 
complex  multiplication  in  GF(  A/2)  can  be  replaced  by  a single  real  addition  and  a table  of 
indexes.  Since  GF(A/2)  contains  Af2-1  nonzero  elements,  the  index  addition  is  mod  A/2-l 
addition,  and  the  table  of  indexes  will  be  of  the  order  of  ( log  M1 ). 

ICPRNS,  Similar  to  the  QRNS,  for  certain  primes,  the  quadratic  equation  *v=-l  mod 
M does  not  have  a solution  in  GF(  M ).  According  to  previous  chapters,  a solution  to  the  con- 
gruence ( or  equation ) can  be  found  in  the  Mh  order  extension  field  GF(  MN ) ( or  Zw*  ) and  is 
denoted  r . Under  these  circumstances  the  set  of  elements  in  GF(A/N)  is 


-83- 


Z mn  ~ z m(t)  = {z0  + z\rx  + . . . + : z,  € Zw  },  (4.42) 

The  rules  of  composition  of  two  elements  u and  v of  ZiW*  when  added  yield  u + v = [ hq  + 
vo . u\  + vi .....  un  + V#  ].  With  mulriplication  defined  by 

N 

“V=C=|V  (4.43) 

where 

k N 

ck  ~ &k-pj  ~ ^ dpN-j+k 
j= 0 ;=*+l 

For  all  primes  not  having  the  form  A/  = + 1 , this  defines  a set  of  primes  that  can  form 

extension  fields  where  the  congruence  is  solvable.  For  example,  let  N = 4,  then  the  prime  M = 
1 1 results  in  the  extension  field  GF(  1 14)  of  14,641  elements. 

The  algebraic  integer  extension  field  G¥{MN ) also  has  a real  index  set  that  is  generated 
by  a complex  primitive  element  of  order  -l . The  existence  of  a real  index  calculus  im- 
plies that  complex  multiplication  in  GF(  MN ) can  be  replaced  by  a single  real  addition  and  a 
table  of  indexes.  Since  GF ) contains  MN -1  nonzero  elements,  the  index  addition  is  mod 
M"-\  addition,  and  the  table  of  indexes  will  be  of  the  order  of  ( logAf^  ). 

4.4.2  Finite  Extension  Field  Transforms 

Iiansforms  Over  Extension  Field  GF(  pm).  Transforms  analogous  to  the  discrete 
Fourier  transform  can  be  defined  in  finite  fields  and  also  be  calculated  efficiently  by  the  FFT 
algorithms.  For  the  finite  field  GF(  //"l.let  be  a divisor  of  fT-l  (possiblyd=  p"1- 1 ),and 
a be  an  element  of  order  d in  the  multiplicative  group  G*  of  GF(  //”)whichis{  l,<z,a2 
ad  1 )•  Then  one  can  define  the  transform  T[d\  of  a sequence  a = { a,  : / =0, 1, 2,  d—  1 ) of 
elements  of  GF(  pT)  to  be  the  sequence  A = { Ak  : k = 0,  1,  2, . . . , d-l  } where 


-84- 


d- 1 


Ak  = X a‘  a‘* 


(4.44) 


(=0 


Its  inverse  transform  is 


d- 1 


a,  = D • £ A*  a 


(4.45) 


where  D is  the  integer  for  which 


D ■ d = 1 mod  pm  - \ ' 


(4.46) 


The  transform  pair  in  Equations  (4.44)  and  (4.45)  may  be  calculated  by  FFT  algo- 
rithms, which  is  simplest  to  perform  when  the  integer  d is  highly  composite.  The  total  num- 
ber of  operations  is  reduced  to  0(  d log  d ) from  0(  d2 ) operations  required  to  calculate  these 
transforms  in  the  most  obvious  way.  The  principal  reason  for  interest  in  the  transform  (4.44) 
lies  in  the  following  “Cyclic  Convolution  Property”  (CCP).  Suppose  that  three  pairs  of  se- 
quences a.  A,  and  b , B,  and  c,  C all  of  length  d form  transform  pairs  as  those  of  Equations 
(4.44)  and  (4.45)  that 


wherey  + k-i  mod  d and  0 ^ ^ d—  1 . If  we  extend  the  definition  of  b to  all  / by  making 

the  sequence  period  with  period  d , Equation  (4.48)  may  be  written 
d- 1 


C(  = Ai  -Bi  , 0 £ i d - 1 , 


(4.47) 


then 


d- 1 d- 1 


c‘  = X X a,  bk  , 


(4.48) 


p=0  k=0 


(4.49) 


Thus,  the  calculation  of  the  ‘cyclic  convolution”  of  the  sequences  a and  b,  as  defined  by 
Equation  (4.48),  may  be  obtained  by  transforming  the  sequences,  multiplying  the  results 
term-by-term  as  in  Equation  (4.47),  and  performing  the  inverse  transform  (4.45). 


-85- 


To  show  the  CCP  of  Equations  (4.47)  and  (4.48),  we  derive  the  discrete  delta  function 
<5 <*(/) , observe  first  that  all  element  x of  G*  satisfy  the  equation 

xd  - \ = 0 • (4.50) 

However  since  the  equation  factors  as 

1 - <*-  (4.51) 

k= 0 


for.x  * 1,  one  has 

d- 1 

i**  = o. 

k= 0 


(4.52) 


Now  consider  the  sum  of  a1 , where  a e G*  and  a = xk  , this  is 


d- 1 

Z 

£=0 


Z(  )'  = 

fc=0 


(4.53) 


For  the  case  of  / = 0 mod  d,  e G*  and  /3  = A from  Equation  (4.52)  it  becomes 


4-i 

Z 

*= o 


Z/3*  = z^*  = °. 

fc=0 


(4.54) 


however  for  / = 0 mod  d,  x1  = 1 , 
d-\ 

Z^  = (4.55) 

/fe=0 

From  the  last  two  equations,  we  represent 
4-1 

= J-c5<*(/). 

k=Q 

By  Equation  (4.45)  the  inverse  transform  of  C*  is 
4-1 

ci  = D^C&~lk 
k= 0 


-86- 


d- 1 


= D • £(  Ak-  Bk  ) a lk 

fc=0 


d- 1 d- 1 d-1 

= D-ZCZajaf*  ^brrfi^Xx-*1 

£= 0 y=0  m=0 

d-1  d-1  d-1 

= ) 

7=0  m=0  A=0 


d-1  d-1 

= £ • X X aA*(  d ‘ <W  + m-i)  ) 

j= 0 m=0 

<i-l  d-1 

= D • d X X . 

y=0  m=0 

With  j + m = i mod  d we  conclude  that 
d- 1 

— X tyb(i-j)  mod  d . 

y=0 

Coniueacy  Property  of  Finite  Field.  Elements  of  the  field  with  the  same  minimal  poly- 
nomial are  called  conjugates  with  respect  to  GF(p).  For  P e GF(pm),thep  powers  of  p fall 
into  disjoint  conjugate  sets.  A typical  conjugate  set  B is  shown  as 

B = ( P>  P,  Pp\  ■ ■ ■ , fP  } (4.56) 

where  / is  called  the  coset  length  and  represents  the  smallest  positive  integer  such  that  ^ = 
P . The  conjugate  sets  can  also  be  defined  in  terms  of  the  exponents  of  P where  the  operation 
of  multiplying  the  exponents  by  p divides  the  integers  modulo  ( pm  - 1 ) into  sets  of  conjugate 
class  called  the  cyclotomic  cosets.  A cyclotomic  coset  C containing  s consists  of  the  set 

C ~ ( sp,  sp 2,  . . . , V'"1  } , (4.57) 

where  ls  is  the  smallest  positive  integer  that  satisfies 


pl,s  = s mod  pm  - 1 . 


(4.58) 


-87- 


The  coset  is  then  denoted  by  Cs  , where  s called  the  coset  leader  is  the  smallest  number  in  the 
coset.  Some  properties  of  the  coset  length  are  ( 1 ) if  0 e Cs , then  ls  = 1 ; (2)  given  any  div  isor  t 

of  m (excluding  r=  1 whenp  = 2),  0 = (pm-  1 )/(/?'-  1)  e Cs,ls  = r;  and  (3)  With  Q as  in  (2) 
and  a a primitive  element  of  GF(  2m ),  a9  generates  the  subfield  GF(  p1, ) of  GF(  pm  ) and 
consequently  ls  I n.  Note  that  for  s = 0 it  is  considered  as  a coset  of  length  1. 

Example  4,4  The  cyclotomic  cosets  of  small  finite  fields. 

Consider  the  finite  field  GF(24).  Let  d=  24  — 1,  then  d,  is  { 1,3,5, 15  }.  Because  3 divides  22 

- 1 and  0 (3)  = 2,  there  is  one  coset  of  length  2 with  elements  of  order  3.  Since  15  divides  24  - 

1 , there  are  2 (=  0(  1 5)/4  ) cosets  of  length  4 with  elements  of  order  15.  Lastly  5 divides  24  - 
1,  there  exists  a coset  of  length  4 with  elements  of  order  5.  A summary  is  listed  below. 

Co  = { 0 } - element  of  order  1 

Ci  = { 1,  2,  4,  8 } - elements  of  order  15 

C3  = { 3,  6,  12, 9 } — elements  of  order  5 

C5  = { 5,  10  } - elements  of  order  3 

Ct  = { 7,  14,  13,  11  } - elements  of  order  15. 

For  the  finite  field  GF(  25 ),  follow  the  procedures  in  the  previous  case.  One  finds  6 (= 

0(3 1)/5  ) cosets  of  length  5 with  elements  of  order  3 1 and  one  coset  of  length  1 . These  cosets 
are 

Co  = { 0 } 

Ci  = { 1,2, 4,  8,  16} 

C3  = ( 3,  6,  12,  24,  17  } 


C5  = ( 5,  10,  20,  9,18} 


-88- 


C7  = { 7,  14,  28,  25,  19  ) 

Cu  =(  11,22,13,26,21  } 

Cis  = ( 15,  30,29,27,23  }.  ♦ 

Given  an  integer  d with  d I ( pm  - 1 ),  the  number  of  cyclotomic  cosets,  denoted  by  c,  and 
the  length  of  a coset,  /,  can  be  calculated  as  follows.  Let  be  a divisor  of  d for  i = 1 , 2, . . . , t. 
Note  that  Euler’s  theorem  indicates  di  is  also  the  possible  order  of  the  elements  in  the  finite 
field  GF(  pm  ).  There  exists  c,-  cyclotomic  cosets  of  length  /,•  where  Cj  = <p(di)/li , /;  is  the 

least  positive  integer  such  that  p1,  = 1 mod  dt . Hence,  the  number  of  the  cyclotomic  coset,  c, 
of  the  field  of  order  d becomes 

c = X . (4.59) 

di  l pH- 1 

Example  4,5  The  number  of  cyclotomic  coset. 

1)  In  the  case  of  the  finite  field,  GF(28),  since  dl(28-l),  let  d=  85.  Hence,  di  = { 85, 17,5, 1 

} . For  do  = 85,  co  = 0 ( do)/lo  = <f>  (85)/ /o  = 64/ /o . Since  Iq  = 8 is  the  least  positive  integer 
such  that  28  = 1 mod  do,  co  = 8 which  interprets  that  there  are  8 cosets  of  length  8.  Similarly, 
for  d\  = 17, ci  =0  (17)//i  = 16//i  andl7l(28  - 1)  which  indicates  l\  = 8 , therefore,  ci  =2. 

For  d2  = 5,  C2  = 0 (5)/4  = 1 . For  dj,  = 1 , C3  = 1 . Thus,  total  number  of  the  cosets  in  the  case  of  d 
= 85  is  12. 

2)  Let d=  17,  then  dt  = { 17, 1 }.  For  do  = 17,  c0  =0  ( do)/h  =0  (17)//0  = 16/ /0.  Since  l0  = 8 
is  the  least  positive  integer  such  that  28  = 1 mod  do,  co  = 2 which  interprets  that  there  are  2 
cosets  of  length  8.  For  d\  = 1 , Ci  = l . Thus,  total  number  of  the  cosets  in  the  case  of  d = 1 7 is  3. 

♦ 


-89- 


In  previous  chapter  we  note  that  p‘  th  powers  of  a field  element  in  the  finite  field 
GF (pm  ) of  characteristic  p falls  into  the  same  cyclotomic  coset.  This  conjugacy  property  of 
the  cyclotomic  cosets  in  the  finite  field  GF(  pm  ) can  be  applied  to  expedite  finite  field  trans- 
forms. 

Theorem  4.1:  For  V a vector  of  length  d of  elements  of  GF(  pm  ),  where  d is  a divisor  of  pm- 1 
, then  the  inverse  Fourier  transform  v is  a vector  of  elements  of  GF(  p)  if  and  only  if  the  fol- 
lowing equations  are  satisfied 


where  V)  e V and  y'  = 0,  1,-  • •,  d-l. 

Proof.  By  definition  of  the  finite  field  transformation  in  Equation  (4.44)  and  the  power  form- 
ing property  in  Corollary  of  Equation  (2.31),  it  becomes 


i=0 

Further  if  v(-  is  an  element  of  GF(  p ) for  all  /,  then  vf  = v,- . Consequently  this  gives 
d-l 


■ = V 

J (pj)  mod  d * 


(4.60) 


d-l 


= z#*™. 


V*  = £ vfl'W 


1=0 


V, 


(pj)  mod  d ■ 


Conversely,  suppose  that  for  all  j,  V1}  = V{ 


'(pj)  mod  d ■ Then 


d-l 


d-l 


Zvfa^  = Z vfli(pi). 


i=0 


i=0 


-90- 


where  j = 0,  1,-  • d- 1.  Let  k = pj.  Because  p is  relatively  prime  to  d,  as  j ranges  over  all 
values  between  0 and  d- 1,  k also  takes  on  all  values  between  0 and  d- 1.  Hence 


d- 1 


Xvfa*  = 

i=0  1=0 


(4.61) 


where  k = 0,l,-  • -,d-\.  And  by  uniqueness  of  the  Fourier  transform,  vf  = v,-  for  all  i.  Thus 

v,  is  a zero  of  xP  -x  for  all  and  such  zeros  are  all  elements  of  GF(  p ).  ♦ 

This  theorem  asserts  that  if  the  time-domain  signal  v is  in  GFf  p ),  then  the  value  of  the  spec- 
trum Vj  in  the  CYClotomic  coset  C specifies  the  value  of  the  spectrum  at  all  other  frequencies  ( 
or  elements ) in  C. 


Normal  Basis  Applications.  ForVe  GF(  pm),  Nn  = {a  ,ap,aP\ . . . ,a^'  }anormal 
basis.  Thus, 

V = v0 a + v\ap  + vyaP1  + . . . + vm^1a/r"‘ 
where  v,-  g GF(  p ),  i = 0,  1, . . . , m- 1.  According  to  Theorem  2.3. 

VP  = vga?  + ^aP1  + v^3  + . . . + yPm_xaT 

= vm-\a  + vo ap  + V\(zpl  + v2 api  + ...  + vm_2 aT’\ 

Cyclic  Shift  Property,  From  the  normal  basis  property  shown  above,  the  operation  of 
takingpth  powers  of  Vis  equivalent  to  continuing  cyclic  shifts.  A pictorial  presentation  of  the 
evaluation  of  \P  is  shown  in  Figure  4.3. 

FxatTtplc  4,6  ConsiderGF(24)s=£  Z2  [.x]/(jc*  + aP  + 1 ).  Two  of  normal  bases  are  illustrated, 
N„i  = { a ,a2,a4,a8  } = { x,  x2 , x2  + 1,  x?  + x2  +x  },  which  is  called  primitive  normal  basis 
due  to  the  fact  that  .r  is  a primitive  element  of  the  field,  and  the  other  one  Nnl  = { a3 , a6 , a 1 : 2, 
a9  } = { r* , x?  + x2  +x  + I,*  + 1,  x2  + 1 }.  Field  elements  are  represented  under  various 


-91  - 


Figure  4.3  The  Cyclic  Shift  Property  of  Normal  Basis  Representation  of  V 


bases  and  shown  as  following  table.  To  show  the  cyclic  shift  property,  let  V = a? , then  V2  = 
a12,  V4  = a24  = a 9 , and  Vs  = a48  = a3 . From  Table  4.3,  We  look  for  the  primitive  normal 
basis  representation  of  V,  V2,  V4 , V8  which  are  ( 1 , 1, 1,0),  (0, 1,1, 1),(1,0,  l,l),and(l,  1,0, 
1),  respectively.  Similarly,  the  normal  basis  representation  of  V,  V2,  V4,  V8  are(0, 1,0,0),  (0, 
0,  1, 0),  (0, 0, 0,  1),  and  (1,  0,  0, 0),  respective.  These  shows  the  property.  For  both  normal 
bases,  the  element  1 has  basis  representation  (1,  1,  1,  1).  ♦ 


-92- 


Table  4.3  Field  Elements  of  GF(24)  Represented  Under  Primal,  Primitive  Normal, 
and  Normal  Bases 


Elements 

Primal  basis 
(ao,  a\,a2,  ctj) 

Primitive  Normal 
Basis  (po,P\,P2-Pi) 

Normal  Basis 
(«0,«i,«2,"3) 

a"30 

= 0 

(0,  0,  0,  0) 

(0,  0,  0,  0) 

(0,  0,  0,  0) 

a0 

= 1 

(1,0,  0,0) 

(1,  1,  1,  1) 

(1,  1,  1,  1) 

a 

= .r 

(0,  1,0,0) 

(1,0, 0,  0) 

(1,1,0,  1) 

a2 

= ;t2 

(0,  0,  1,0) 

(0,  1,0,0) 

(1,  1,  1,0) 

a3 

= .*3 

(0,  0,0,1) 

(1,  1,0,  1) 

(1,0,  0,  0) 

a4 

= ^+1 

(1,0,0,  1) 

(0,  0,  1,0) 

(0,  1,  1,  1) 

a5 

= .1^+  x + 1 

(1,  1,0,  1) 

(1,0,  1,0) 

(1,0,  1,0) 

a6 

= X*+X2  + X+\ 

(1,  1,  1,  1) 

(1,1,  1,0) 

(0,1,0,  0) 

a1 

= X2  + X + 1 

(1,  1,  1,0) 

(0,  0,  1,  1) 

(1,1,0,  0) 

a8 

= X3  + X2  + X 

(0,  1,  1,  1) 

(0,  0,0,  1) 

(1,0,  1,  1) 

a9 

= -f2+  1 

(1,0,  1,0) 

(1,0,  1,1) 

(0,  0,0,  1) 

a10 

K 

+ 

II 

(0,  1,0,  1) 

(0,  1,0,  1) 

(0,  1,0,  1) 

a11 

=^+^+1 

(1,0,  1,  1) 

(0,  1,  1,0) 

(1,0,  0,  1) 

a12 

= * + 1 

(1,  1,0,  0) 

(0,  1,  1,  1) 

(0, 0,1,0) 

a13 

= X2+  X 

(0,  1,  1,0) 

(1,  1,0,  0) 

(0,0,  1,  1) 

a14 

= X3+  X2 

(0,  0,  1,  1) 

(1,0,0,  1) 

(0,  1,  1,0) 

CHAPTER  5 

FAST  DISCRETE  FOURIER  TRANSFORMS  OVER  FINITE  FIELDS 

5.1  Introduction 

The  transform  with  the  cyclic  convolution  property  (CCP)  over  the  extension  field 
GF(  pm ) is  shown  to  be  an  effective  application  for  digital  signal  processing.  In  this  chapter 
the  Good-Thomas  prime-factor  FFT  algorithm  is  applied  over  the  Galois  field  GF(  pm ).  The 
intermediate  results  are  represented  by  using  a normal  basis  representation.  A significant 
computational  reduction  is  obtained  by  applying  a conjugacy  relation  to  a cyclotomic  coset 
of  the  intermediate  variables,  and  by  using  the  cyclic-shift  property  of  p powers  of  the  vari- 
ables within  the  normal  basis  representation.  Once  the  VLSI  architecture  for  the  butterfly 
module  of  a cyclotomic  coset  based  on  the  algorithm  is  developed,  these  module  arrays  are 
used  to  form  the  stages  of  the  fast  transform.  For  the  case  of  p = 2,  m = 8,  performance  analy- 
sis shows  a six-fold  reduction  in  computational  complexity  which  is  achieved  in  terms  of  an 
added  hardware  budget. 

5.2  Fast  Prime-Factor  Finite  Field  Transform 

Blahut  [Bla83]  defined  the  discrete  Fourier  transforms  over  the  Galois  field  GF(  pm). 
We  consider  v = { v,-  : i = 0,  1 , . . . , d-l  } a vector  over  GF(  p ),  where  d divides  pm  - 1 for 
some  m,  and  r an  element  of  GF(  p m ) of  order  d.  The  Galois  field  transform  of  the  vector  v is 
V - { Vj  -j  = 0,  1, ... , d- 1 },  where  Vj  e GF (pm).  The  transform  pair  is  given  by 

VJ  = S ^ v,- , (5.1) 

1=0 


-93- 


-94- 


and 


d- 1 

v,  = d~l  X r~lJ  vi  • (5.2) 

y=o 

where  d-1  denotes  the  inverse  of  d and  satisfies  d~ld  = 1 mod  p. 

This  transform  can  be  expedited  by  traditional  F FT- type  algorithms,  but  still  has  the 
computational  complexity  of  O (d  log d).  This  research  introduces  the  concept  of  normal  basis 
representation  in  GF(  p m ) and  the  conjugacy  property  of  finite  field  elements.  Application  of 
these  concepts  to  the  existing  prime-factor  fast  transform  not  only  serves  to  simplify  the 
transform,  but  also  leads  to  the  development  of  a very  compact  device.  This  paper  demon- 
strates the  forward  transform.  The  inverse  transform  can  also  be  processed  based  upon  this 
concept. 

The  Good-Thomas  prime-factor  FFT  algorithm  is  used  to  integrate  a set  of  DFTs  over  a 
Galois  field  [Red86].  This  algorithm  relys  upon  index  expansions  based  on  the  Chinese  re- 
mainder theorem  (CRT).  The  development  of  the  normal  basis  Good-Thomas  FFT  over  a 
Galois  field  is  summarized  as  follows.  Consider  the  case 

S 

d = Y[Pi  , ( Pi,  Pj  ) = 1,  I*  j,  and  let 

i=i 

5 

dk.  = \\Pi  ,1  < k <,  s . 

L=k 

There  are  s stages  in  the  fast  transform  which  the  initial  stages  based  on  ps  and  the  last  stage 
based  on  P\  \ intermediate  results  are  represented  by  the  symbol  X. 

At  the  first  stage  the  scrambled  input  and  output  indices  i and  j are  expanded 

(5.3) 


i = i\(d2D2)  + i2(p\C\), 
j = j\  d2  + j2  pi  t 


(5.4) 


-95- 


where  i'i  = i mod  p\,  i2  = i mod  d2,  j\  = ]D2  mod  p\,  and  j2  = jC\  mod  d2.  Note 
that  Ci  and  £>2  are  obtained  by  virtue  of  the  fact  that  Pi  and  d2  are  relatively  prime  and  by 
using  the  Euclidean  algorithm  such  that 

1 = p\C\  + ^2.  (5.5) 

The  transform  may  be  expressed  in  the  following  form 

d- 1 

V)  = S •“  V, 

1=0 

&r t Pi-t 

— ^ ^ \Yj \di+jiP\) 

<2=0  *1=0 

<^2—1  Pl~l 

= ^ (fPiyviPiC^  ^ ^ v.  j 

*2=0  lj=0 

^2-1 

= i(^lCl  XJuk  . 

*2=0 

By  applying  the  relation  in  Equation  (5.5)  to  the  above  equation,  we  have 


(rPiydiPiCi  — (jPx^-diDiYy-ji 

= (rP'r^2^2)1^2 

- (jPiyijl 

Thus,  the  transform  becomes 

dj—l 

Vj  ~ 

<2=0 


(5.6) 


(5.7) 


The  next  stage  expands  the  indices  z’2  and  j2  by  a similar  procedure  as  those  in  Equa- 
tions (5.3)  and  (5.4) 


h - hUhDj)  + hipiCi), 

h = J2  d3  + y3  p2  , 


(5.8) 


-96- 


where  h = z mcxl  Pi,  z'3  = z mod  di , and 
j'l  = jiDi  mod  p2 

= ((/'Ci  mod  di)  D3)  mod  pi 
= ((/'Ci  mod  di)  mod  pi)  (£>3  mod  pi) 

= (/'Ci  mod  pi)  (£>3  mod  pi) 

= 7C1D3  mod  pi , 
and 

73  = jiCi  mod  di 

= ((/Ci  mod  di)Ci)  mod  d^ 

= jC\Ci  mod  di . 

Similar  to  the  initial  stage,  pi  and  J3  are  relatively  prime  such  that 
1 = P2C2  + d^Di . 

The  output  of  this  stage  is  given  by  employing  Xjuil  of  the  output  in  the  first  stage 

^2“l 
Vi  - 1 
<2=0 


<^3—1 

<3=0  <2=0 


P2-1 


<2= 0 


7l.<2 


X:  , 


(5.9) 

(5.10) 

(5.11) 


Similar  to  Equation  (5.6),  we  apply  Equation  (5.11)  and  yield 


-97- 


(fP'iPty'ihPiCi  _ ^'^2(l-^}03)yj/3 

— (fP\Piyih  _ 

Thus,  the  transform  to  this  point  becomes 


Vy  = £ {rPVip :j  X 

<3=0 


7lj2.<3  • 


(5.12) 


The  general  stage  in  the  Good-Thomas  algorithm  is  expressible  in  terms  of  the  follow- 
ing equations.  At  stage  k,  the  scrambled  input  and  output  indices  becomes 

k = i'kjdk+\Dk+\)  + ik+\(pkCk) , (5.13) 

i 

jk  = jk  dk+ 1 + jk+l  Pk  , 

with 

f 

jk  - ( j{C\Ci  . . . Ck-i)Dk+ 1 ] mod  pk, 
jk+i  = ;'(CiC2  ■ ■ . Ck)  mod  dk+l, 
where  Ck  and  Dk+\  satisfy  the  equation 

1 = PkCk  + dk+\Dk+i . (5.14) 

The  output  of  this  stage  is  analogous  to  the  procedures  shown  earlier. 

v>  * X/A-X.A  . 

<4=0 

<4*1-1  Pk~  1 

= X’  ( lA! <4*1\( <4<4*  l^.l  +<4*  \PlPk)(j'l4k*  l +y*.  y 

} XhJi,-jk-^k 

(*♦1=0  lt=0 

<4*1—1  Pj-1 

= ^ (/'^/<4*1),**0**<  J ^ (fd/ Pk^klkidk+lDk+y)  ^ j 

<4*1=0  4=0 

<4*|— 1 

= S'  (/*M*l\W4*l  V , 

^ 1 ' A7i72.-74.<4*i  » 

<4.1-0 


(5.15) 


-98- 


where  the  output  index  j expanded  to  stage  k is  defined  as 

j = j\d2  + j'lPidj  + + . . . + jkP\Pi---Pk-\dk+\  + jk+\P\Pi---Pk 

= \j\(d/p\)  + ji(d/p2)  + . . . + jkid/pk)  + jk+i(d/dk+i)]  mod  d.  (5.16) 

Since  the  intermediate  results  of  the  Good-Thomas  transform  are  the  elements  of  the 
finite  field  GF(  pm),  the  conjugacy  propeny  is  applicable  and  the  transform  process  is  sim- 
plified. Letting  t = 1,  2, the  notation  Rit  is  defined  as 

R.j  = jip‘  mod  pi  , for  / = 1,  2 k.  (5.17) 

If  the  output  of  the  initial  stage  in  Equation  (5.7)  is  raised  it  to  the  pth  power 

V,  ) 

»'i=0 

= (5.18) 

^i=0 

then  according  to  the  finite  field  properties  discussed  in  Chapter  3 and  using  Equation  (5.17) 
leads  to  the  following  general  expression  of  the  result  of  the  initial  stage 

(xhJ/  = X(^,)W<W>J>  v‘ 

*i=0 

Similarly,  the  output  of  the  second  stage  is  obtained  and  shown  as  follows. 

*2=0 
P2~t 

= ^ (fd/pipRiMiP j)  Xp 
*2=0 


(5.19) 


-99- 


The  general  form  of  the  conjugacy  property  is  demonstrated  by  raising  the  /rth  power  of 
the  defining  items  , in  Equation  (5. 15)  for  the  intermediate  results  from  stage  k 


Pk- 1 


(*, 


y>‘  - V 
4=o 


Xp  - 

JiJi- 


■Jk-l-lk 


Pk~  1 

= X(^/p*) 


4=o 


RijMis-Jlk-ijjk 


= X 


R\jJtx><-  ’Rkj,‘k+l  ■ 


(5.20) 


These  intermediate  results  are  carried  through  to  the  final  output  values  where  output  vj  is 
expressed  by  expanding)  according  to  Equation  (5.16) 


p>- 1 


<r=0 


= V, 


[Rij(d/pl)+R‘2j(d/p2)+...+R‘,/d/p,)\  mcxi  d . 


(5.21) 


5.2.1  The  Application  of  Normal  Basis  and  Conjugacv  Properties 

Mac  Williams  and  Sloan  [Mac77]  (also  see  Chapter  3 and  4)  suggest  that  there  always 
exists  a normal  basis  in  the  finite  field  GF(  pm ) for  all  positive  integers  m.  In  other  words,  one 

can  find  a field  element  a such  that  Nn  = { a , ap , apl , . . .,aTx } is  a basis  set  of  GF(  pm ). 
Thus,  the  field  element  Vj  e GF(  pm ) can  be  uniquely  expressed  in  terms  of  the  basis  Nn  as 

Vj  = V/.o a + vjAap  + vy>2c/  + . . . + Vj^aT\  (5.22) 


where  vy,i  g GF(  p ),  i = 0, 1, . . .,  m~\.  According  to  the  binomial  theorem  and  the  properties 
of  the  finite  field  GF(  pm ),  the  pth  power  of  Vj  becomes 

Vf  = Vj^ia  + vjSpP  + vjAap2  + ...  + v]jn_^rx . (5.23) 

Hence,  Vy  is  simply  a cyclic  shift  of  Vj  in  the  normal  basis  representation. 


- 100  - 


As  stated  in  Chapter  4,  elements  of  the  field  GF(  pm)  with  the  same  minimal  polyno- 
mial are  called  conjugates  with  respect  to  GFf  p ).  For  a e GF(  pm ),  the  p powers  of  a fall 
into  disjoint  conjugate  sets. 

5.2.2  A Normal  Basis  Architecture  for  the  Finite  Field  Transform 

A modularized  VLSI  circuitry  is  developed  for  the  butterfly  module  of  a cyclotomic 
coset  in  Figure  5. 1 . The  module  consists  of  Pk  -1  normal  basis  multipliers,  a Pk  -input  digit- 


Figure  5.1  The  VLSI  Butterfly  Module  CB  of  a Cyclotomic  Coset  of  Length  / 


wise  adder,  a cyclic  shift  routing  network,  and  /,  m-digit  registers.  Generation  of  the  / inter- 
mediate variables  occurs  in  the  coset  module  where  there  is  only  one  variable  which  needs  to 
be  evaluated.  The  operations  include  Pk  -1  multiplications  and  additions.  The  remaining  / - 
1 variables  are  obtained  by  a cyclic-shifting  of  the  evaluated  variable  within  a pipeline  clock 
period.  Note  that  in  Figure  5.1 


- 101  - 


/kg  _ (rn/pkykjk(*k*iDk.i)  ' ^ 

where  i'k  =0,1,...,  pt-\,k  = 1,2 ,5-1. 

These  modules  serve  as  basic  elements  in  the  hierarchical  design  of  the  Galois-field 
transform  system  shown  in  Figure  5.2  where  the  transform  is  based  on  ps  in  stage  one,  on 


Figure  5.2.  The  s Stages  Fast  Galois  Field  Transform  System 


Ps  1 in  stage  two,  and  on  p\  in  the  last  stage  s.  At  stage  k of  the  transform  pipeline,  the 
intermediate  variables  X are  grouped  to  bk  blocks  of  each  size  zk  , where 

s-k 

bk  = ]\pr  (5.25) 

r-\ 


for  s-k  > 0,  and  bk  = 1 otherwise; 


- 102  - 


k 

Zk  = FI  Ps-k+i  • (5.26) 

r=l 


Example  5.1:  The  255-Point  Good-Thomas  FFT. 

Let  GF(  pm),  with  p = 2,  m = 8,  d = 255  = Px-PiPh  = 3 • 5 • 17  . Stage  one  is  based  on  17; 
stage  two  is  based  on  five;  and  stage  three  is  based  on  three.  Figure  5.3  shows  the  FFT  pipe- 


line stages. 

In  order  to  evaluate  the  operations  required  for  the  transform,  the  number  and  the  length 
of  the  cosets  for  each  stage  are  counted.  At  the  output  stage,  d = 255,  the  divisors  of  d are  255, 

85,51, 17, 15,5, 3,  and  1.  Since  numbers  255, 85, 51, 17  are  all  divisors  of  28  — 1,  there  exist 
30  cosets  of  length  eight  ( <f>  (255)/8+0  (85)/8  +<p  (5 1 )/8  +<p  (17)/8  = 30).  Similarly,  15and 
Five  divides  24  -1  and  <f>  (5)  = 4 which  yields  three  cosets  of  length  four.  The  remaining  three 
and  one  yield  one  each  of  length  one  and  two.  Hence,  the  total  number  of  the  cosets  in  the 
stage  is  35.  At  stage  two,  the  intermediate  variables  are  grouped  into  three  blocks  of  size  85. 
Each  of  the  85  point  blocks  has  10  cosets  of  length  eight,  and  one  each  of  length  one  and  four. 
Hence,  the  total  number  of  the  cosets  in  the  stage  is  12.  At  stage  one,  there  are  15  blocks  of 


- 103  - 


size  17,  each  block  has  two  cosets  of  length  eight  and  one  of  length  one.  Thus,  the  total  num- 
ber of  cosets  in  the  stage  is  3.  Using  the  properties  of  cyclotomic  coset  and  the  normal  basis 
representation,  the  complexity  analysis  between  conventional  and  normal  basis  approaches, 
in  terms  of  the  number  of  operations  (multiplications  and  additions  involved),  is  given  in 
Table  5.1.  ♦ 


Table  5.1  The  Complexity  Comparison  Between  Conventional 
and  Normal  Basis  Approaches 


\ method 
stages  \ 

conventional 

normal  basis 

1 

15  • 17  • (17-1)  = 4080 

15  • 3 

• (17-1)  = 720 

2 

3-  85  • (5-1)=  1020 

3-  12 

• (5-1)  = 144 

3 

1 • 255  • (3-1)  = 510 

1 • 35 

• (3-1)  = 70 

total 

5610 

934 

5.2.3  System  Performance  Analysis  and  Discussions 

A comparison  between  the  total  number  of  operations  involved  for  each  method 
(Table  5.1)  in  their  respective  approach  to  the  direct  DFT  which  required  approximately 
65,000  operations  clearly  suggests  a dramatic  reduction  in  computational  operations  when 
the  normal  basis  system  is  implemented.  This  system  takes  advantage  of  the  multiplication- 
free  shifting  by  applying  the  normal  basis  representation  to  the  variables  in  the  pipeline 
stages.  In  terms  of  physical  realities,  the  simplified  computational  structure  makes  this  sys- 
tem readily  adaptable  to  a compact  device  without  relinquishing  any  of  its  efficiency.  Inter- 
lacing with  pre-existing  systems  which  have  representations  in  a different  format,  such  as  the 
power  form  or  the  standard  polynomial  form,  simply  involves  making  a basis  change  at  the 
end  of  the  processing  pipeline. 


-104- 


For  the  implementation  of  prime  factor  FFT  over  the  Finite  field  GF(  Pm).p  and  m have 
to  be  properly  chosen  to  meet  the  number  of  input  data  samples  and  their  range.  There  are 
some  general  guidelines  to  assist  the  development  of  an  efficient  system. 

1 ) The  dynamic  range  of  the  data  sequences  to  be  transformed  is  p,  which  is  the  characteristic 

of  the  Finite  Field  GF(  P m ),  according  to  Blahut’s  theory.  For  example,  for  8-bit  system,  p 
should  be  a prime  < 256;  16-bit  system  p a prime  < 65,536; 

2)  The  length  of  the  data  sequences  d is  a factor  of  Pm-  1; 

3)  To  be  suitable  for  applying  FFT  algorithms,  the  length  d should  be  a highly  composite; 

4)  To  effectively  utilize  the  conjugacy  property  of  Field  elements,  we  should  properly  choice 
d and  its  factors  such  that  less  cosets  with  larger  length  are  obtained; 

5)  The  possible  lengths  of  cosets  are  the  factors  of  m.  For  example,  for  GF(  28  ) the  possible 

lengths  of  the  cosets  are  8, 4, 2, 1 . Similarly,  the  possible  length  in  GF(  25  ) are  5 and  1.  Thus, 
it  is  better  to  have  larger  m ( or  even  prime  m ); 

6)  Redinbo  [Rad86]  suggested  that  the  order  of  stages  in  the  prime  factor  FFT  is  so  essential 
that  it  will  determine  the  complexity  of  the  system.  For  example,  for  GF(  28  ),  let  d = 255,  the 


factorsoftfare3,5, 17.  For  the  case  of  P l =3,  Pi  =5,  Pt>  = 17,  the  operations  needed  for  each 
stage  are: 

Stage  1(P3):  15  • 3 ■ ( 17  - 1 ) = 720 

Stage  2 ( Pi) : 3 • 12  • ( 5 - 1 ) = 144 

Stage  3 (Pi):  1- 35  • ( 3 - 1 ) = 70 

Total  operations  : 934. 

However,  for  Pi  = 17,  Pi  = 5,  P3  = 3,  the  operations  needed  for  each  stage  are: 

Stage  1 ( Pi):  85  • 2 • ( 3 - 1 ) = 340 


Stage  2 ( Pi)  : 


17-5(5  — 1)  = 340 


- 105  — 

Stage  3 (Pi):  1- 35  • ( 17  - 1 ) = 560. 

Total  operations  : 1240; 

7)  It  would  be  nice  to  have  all  intermediate  results  in  the  transform  be  always  in  normal  basis 
representation.  However,  the  finite  field  multiplication  of  r1  and  field  elements  during  the 
evaluation  of  intermediate  variables  results  in  non-normal  basis  representations.  This  leads 
some  extra  work  to  convert  back  to  normal  basis.  To  solve  this  problem,  r 1 has  to  be  repre- 
sented in  normal  basis  and  apply  normal-basis  multipliers  or  basis  conversions  at  the  end  of 
the  evaluation  should  be  applied.  The  advantage  of  normal  basis  multiplier  is  its  simplicity 
and  regularity;  but  it  might  take  lots  of  space  to  implement  the  multipliers  for  the  evaluation 
of  intermediate  variables.  On  the  other  hand,  digit  shift  and  digit-wise  addition  are  simply 
performed  then  conversion  from  primal  basis  to  normal  basis  is  applied.  This  conversion  can 
be  drastically  simplified  if  a self-dual  normal  basis  to  represent  field  elements  can  be  found 
and  used  [Lid86]. 

5.3  The  Basis-change  Algorithm  for  Fast  Finite  Field  Transforms 

As  fast  Fourier  transform  (FFT)  algorithms  are  applied  over  the  finite  field  GF(  pm ),  the 
intermediate  results  are  encoded  in  a normal  basis  representation.  A significant  computa- 
tional reduction  is  obtained  due  to  the  conjugacy  relationship  within  a cyclotomic  coset  of 
the  intermediate  variables,  and  the  cyclic  shifting  property  of  p powers  of  the  variables  in  a 
normal  basis  representation.  However,  the  computation  of  each  element  within  the  conjugate 
cosets  is  still  an  intensive  multiplicative  task.  To  mitigate  the  problem,  a fast  and  compact 
computational  structure  based  on  the  basis-change  algorithm  is  used  to  perform  the  multiply- 
accumulate  operations  over  a finite  field.  Once  the  butterfly  operation  module  of  a 
cyclotomic  coset  based  on  the  algorithm  is  developed,  the  integration  of  the  modules  leads  to 
an  efficient  VLSI  implementation  of  the  pipeline  stages  of  the  fast  transform. 

While  this  transform  can  be  expedited  by  traditional  FFT-type  algorithms  [McC79], 
its  computational  complexity  remains  O (d  log d).  The  concept  of  applying  the  conjugacy 


-106- 


property  and  the  normal  basis  representation  of  finite  field  elements  to  the  transform  to  re- 
duce its  complexity  was  introduced  by  Kao,  Taylor  [Kao90]  and  Redinbo  [Red86].  However, 
under  normal  basis  representations,  the  evaluation  of  a field  element  of  a cyclotomic  coset 
within  finite  field  transforms  introduces  a nontrivial  multiply-accumulate  task  which  may 
results  in  an  extensive  application  of  the  conventional  finite  field  multipliers  (see  Chapter  3). 

This  research  effort  suggests  a new  computational  algorithm  which  involves  the 
change  of  basis  of  field  elements  during  the  nontrivial  evaluation  of  the  coset  elements.  This 
finite  field  arithmetic  is  equipped  with  cyclic  shifting  for  scaling,  simple  table  lookups  for 
polynomial  reduction,  and  a series  of  component- wise  modular  adders.  Application  of  these 
concepts  to  the  existing  fast  transform  not  only  serves  to  simplify  the  transform,  but  also 
leads  to  the  development  of  a very  compact  signal  processing  device  which  can  be  implem- 
ented as  a building  block  and  seamlessly  integrated  to  a discrete  Fourier  transform  (DFT) 
system. 

5.3.1  The  Conjugate  Sets  in  a Fast  Finite  Field  Transform 

The  general  form  of  a factor-type  fast  finite  field  transform  Vj  which  is  based  either  on 
the  Cooley-Turkey  or  the  Good-Thomas  algorithm  at  the  stage  k can  be  expressed  as 
<4.  1-1  Pi-1 

v,  = z 1 2y*>  ] 

4.  i=0  it=0 

<4.1-1 

= X (^“0  (5.27) 

**♦1=0 

where  c * and  c*+i  are  functions  of/ andy,  Pk  is  a factor  of  d,  and  X is  an  intermediate  result  of 
the  transform  pipeline. 

Since  the  intermediate  results  of  the  fast  transform  are  the  elements  of  the  finite  field 
GF(  p"1),  the  conjugacy  property  is  applicable  and  the  transform  process  is  simplified.  The 


- 107  - 


conjugacy  property  is  demonstrated  by  raising  the  pth  power  of  the  defining  items 
Xjiji.-Jk.ik* i in  EQuation  (5.27)  for  the  intermediate  results  from  stage  k 


(X 


Pk~l 

\P'  = X'  (f^/Pk\lkjlP\d-k* lDk*i)  J(P 

A jiji.-Jk-\.ik 

it= 0 


- X ^ 

|4=0 


u.‘* 


XR\jMij,-:Jlkj<ik*i  (5.28) 

where  t = 1,  2, . . /,  and  / is  a coset  length. 

A modularized  circuitry  for  the  evaluation  of  a conjugate  set  is  developed  accordingly 
and  demonstrated  in  Figure  5.4.  Generation  of  the  / intermediate  variables  occurs  in  the  eval- 
uation module  where  there  is  only  one  variable  which  needs  to  be  evaluated.  The  operations 
include  Pk-l  modular  multiplications  and  additions  which  are  performed  in  the  multiply-ac- 
cumulate  unit  (MAU).  The  remaining  /-I  variables  are  obtained  by  the  simultaneous  cyclic 
shiftings  of  the  evaluated  variable.  The  design  consists  of  Pk  -1  normal  basis  multipliers,  a 
Pk  -input  m-digit  component-wise  modular  adder,  a cyclic  shift  routing  network,  and  lr  m- 
digit  registers  for  result  storage. 

The  fast  transform  algorithm  dramatically  reduces  computational  complexity.  Howev- 
er, the  evaluation  of  the  leading  element  in  a coset  is  still  a rigorous  multiplicative  task  which 
involves  normal  basis  multiplications  of  a previous  stage  variable  X,  and  a primitive  ele- 
ment r*g  as  is  shown  in  Equation  (5.27). 


5.3.2  The  Basis-Change  Algorithm 


To  maintain  the  inherent  advantage  of  the  cyclic  shifting  within  a normal  basis  scheme, 
the  intermediate  variable  evaluated  in  Equation  (5.27)  must  remain  in  the  normal  basis  form. 
However,  the  result  of  the  multiply-accumulate  operations  always  deviates  from  this  rule. 


- 108- 


V 

★ ^kg  _ R\u~i  )Jl  M-\) Rktl-\)<‘k+l 

: Multiply-Accumulate  Unit  (MAU) 


Figure  5.4  The  Conjugate  Set  Evaluation  Circuit 


The  variables  X,  *_i  in  normal  basis  representation  from  stage  £-1  with  the  multiplication  of 
r*g  are  represented  explicitly  as 

Xijc- ir**  = (fli.or  + a/,i  r ? + a/,2  ^2+  . . . + aijn_x  ^ ) r*g  (5.29) 

where  i = 0, 1,. . p*-l.The  variable  Xltk  of  stage  k which  is  the  sum  of  Equation  (5.29)  for 
all  i becomes  an  “unreduced”  polynomial  of  possible  degree  of  n = pm~x  + (Pk-l)g  and  is 
represented  as 

Xijc  = co  r + d & + . . . + cn  rR*  (5.30) 

where  cs  is  a function  of  aij  for  s = 0, 1 , . . . , n and  q t is  a function  of  ( pe+ fg)  for  t = 0, 1 , . . . , 
n.  Obviously,  Equation  (5.30)  is  not  in  a form  of  the  normal  basis  from  which  the  remaining 


-109- 


variables  in  the  same  coset  can  be  obtained  by  the  cyclic  shift  property.  As  a result,  a novel 
algorithm  capable  of  performing  the  basis-change  process  without  dealing  with  the  normal 
basis  multiplications  is  presented  as  follows: 

1)  Reduce  the  exponent  of  r*'  where  qt>pm-\  by  performing  modulo  (pm-l) 
in  accordance  with  Fermat’s  theorem  which  states  = 1; 

2)  Reduce  the  exponent  of  r”7,  where  qt>m- 1 by  expressing  i*1  in  terms  of  poly- 
nomials modulo  the  field-defining  polynomial  for  GF (//”).  This  may  be  carried  out  simply 
by  th  exponent  table  lookup  operation; 

3)  Rearrange  the  result  of  step  2)  which  may  involve  trivial  multiplications  and 
additions  within  the  ground  field  GF(/?)  to  form  the  primal  basis  representation  of  the  vari- 
able XiJc; 

4)  Build  lookup  tables  of  the  normal  basis  representations  for  each  component  of 
the  primal  basis  vector  [ 1,  r,  r2, . . . , r'"-1  ];  and 

5)  Convert  the  primal  basis  variable  in  step  3)  to  the  normal  basis  form  with  the 
assistance  of  the  lookup  table  in  step  4).  These  operations  only  involves  additions  within  the 
ground  field  GF(p). 

5.3.3  System  Development 


The  MAU  subsystem  in  Figure  5.4  is  further  developed  according  to  the  discussion  in 
the  preceding  section.  The  architecture  of  the  MAU  device  is  subdivided  into  four  processing 
elements  in  a pipeline  fashion.  These  elements  include  the  Scaling  Shifter,  the  Vector  Com- 
biner, the  Primal  Basis  Generator,  and  the  Normal  Basis  Generator  — all  of  which  are  de- 
scribed as  follows: 

1)  Scaling  Shifter:  the  multiplication  of  a intermediate  variable  Xl%k_\  and  the  factor  r*g 
as  found  in  Equation  (5.29)  is  equivalent  to  a feedback  shifting  implementation.  This  is  a 


- no  - 


direct  consequence  of  the  fact  that  h 1 . Figure  5.5  provides  a pictorial  demonstration  of 
the  scaling  shifting  operation  where  r^'+ig  = r1  for  the  case  of  p7"-1  + ig  = pm\ 


Basis : -► 

Coefficient : — *■ 


r° 

r1 

rPm-  2 

0 

aijn-\ 

— 

a U 

... *. 

0 

— » 

Figure  5.5  An  Example  of  the  Scaling  Shifter 


2)  Vector  Combiner:  the  outputs  of  the  Scaling  Shifters  are  all  represented  as  vectors  of 

the  basis  in  the  form  of  [ 1,  r,  r2 * * *, . . . ,^"-2  ].  This  means  that  the  sum  of  these  vector  may 

involve  component-wise  addition  in  the  field  GF(p).  A symbolic  representation  of  the  vector 

combining  unit  is  depicted  in  Figure  5.6; 


Figure  5.6  An  Example  of  the  Vector  Combiner 


3)  Primal  Basis  Generator:  the  ( /t7” — 1 )-digit  output  of  the  vector  combiner  shown  in 
Figure  4 needs  to  be  changed  to  a regular  primal  basis  of  (/w — 1 )-digit.  The  preprogrammed 
lookup  tables  — primal  lookup  table  (PLUT)  — is  built  for  polynomial  reduction  where  dig- 


its  in  the  vector  (polynomial)  have  a corresponding  exponent  greater  than  m-l.  The  table  is 
computed  using  a discrete  exponentiation  method  where  the  exponent  i is  presented  to  the 
address  of  the  table  and  points  accordingly  to  the  content  of  the  memory.  The  relationship  of 
an  exponent  i and  its  corresponding  primal  basis  form  is  expressed  as 

exp r(i)  = r1  = do  + d\r  + d2  r2  + . . . + dm-\  P"-1  . 

A primal  basis  representation  of  the  variable  is  obtained  by  table-lookups  and  additions 
over  GF(p).  A functional  block  is  shown  in  Figure  5.7;  and 


r°  r1  ^ rPmJl 


1 


Figure  5.7  The  Primal  Basis  Generator 

4)  Normal  Basis  Generator:  normal  basis  representations  of  the  powers  of  r are  precom- 
puted, then  stored  in  the  lookup  table  which  is  called  the  normal  lookup  table  (NLUT).  The 
address,),  and  the  content  of  the  table  memory  have  the  relation 

H = dor  + d\  r?  + d'2  . . . + dm_\ 

A similar  architecture  to  the  primal  basis  generator  is  developed  for  the  normal  basis  conver- 
sion and  is  demonstrated  in  Figure  5.8. 

5.3.4  System  Complexity  Evaluation  and  Example 

The  system  complexity,  in  terms  of  hard  ware  implementation  cost,  for  the  four  process- 
ing elements  discussed  in  the  previous  section  is  summarized  as  follows: 


-112  - 


f'n-l 


Figure  5.8  The  Normal  Basis  Generator 

1 ) Scaling  Shifter:  although  it  is  a multiplication  of  an  element  of  the  normal  basis  and  a 
factor  rlg,  the  primary  operation  is  simply  a matter  of  exponential  arithmetic.  Hence,  m mo- 
dulo (pm-\)  adders  are  required  for  each  one  of  the  (Pk-l)  channels  where  the  processing 
latency  is  the  delay  of  an  adder; 

2)  Vector  Combiner:  this  operation  is  equivalent  to  the  addition  of  (Pk-l)  vectors  of 
basis  length  of  pm- 1.  The  worse  case  data  path  requires  (Pk-2)  modulo-/?  adders  with  the 
latency  translated  to  T log: (At -1)1  levels  of  the  adder  operation  (Note:  |".1  represents  the 
smallest  integer  >te  R.); 

3)  Primal  Basis  Generator:  the  conversion  of  the  digit  within  the  vector  to  the  primal 
basis  requires  at  most  ( Pk  - 1 )m  memory  tables  with  the  configuration  of  p by  m(  log:/?).  The 
worse  case  data  path  requires  m channels  of  addition  w hich,  in  turn,  requires  (Pk- 1 )m- 1 mo- 
dulo-/? adders  with  the  latency  off  log2[  (Pk  — 1 ) m ]1  levels  ot  the  adder  operation:  and 


-113- 


4)  Normal  Basis  Generator:  similarly,  the  conversion  of  the  digit  in  a primal  basis  \ ec- 
tor  to  the  normal  basis  requires  m memory  tables  with  the  configuration  of  p by  m log;/?). 
The  worse  case  data  path  requires  m channels  of  addition  which  requires  (m— 1 ) modulo-/? 
adders  with  the  latency  of  I log2(m-l)l  levels  of  the  adder  operation. 

Example  5.2:  The  Basis-Change  algorithm. 

Let  the  finite  field  GF(  pm)  where  p = 2 and  m = 4 have  a defining  polynomial  f(x)  = 
x4  + x3  + 1 . The  set  of  roots  { r,  r2 , r4 , r8  } constitutes  a normal  basis  of  GF(  24 ).  Assume  in 
one  stage  of  a factor-type  transform  Pt  = 3 and  r*g  = r(,5)  for  i = 0,  1,2;  and  the  set  of  the 
previous  stage  variables  in  the  normal  basis  representation  are  { Xo,*  , Xi,*  , X;,*  } = ( r,  r3 , 
r1  } (power  form)  = { (1000),  (1 101),  (0011)}  (normal  basis).  Referring  to  Equation  (5.27), 
the  evaluation  of  the  leading  element  Xo,*+i  in  a conjugate  coset  is  to  evaluate  the  follow  ing 
sum-of-product  in  the  MAU  as 

2 

Xo.<fc+i  = = X0,A  + Xu  + X2,*  . 

1=0 

Applying  the  basis-change  algorithm: 

Step  1.  (Scaling  shifting): 

Xo,*  r°  = r\ 

X\k  ri=r3r5  = r8  = r6+  r1  + r13; 

X2Jc  r10=  r1  r10  = r17  = r14  + r18  ->  ri4  + r32^15  : 

Step  2.  (Vector  combining): 

Xo £ + Xi  i + X2,k  = r + r3  4-  r6  + r7  + r13  + r14; 

Step  3.  (Primal  basis  generation): 

Xo.i+i  = r + r + (r3  -t-  r +r+ 1 ) + (r+r+l)  + ( r2  +r)  + 

(r3 -1-r2)  = r3;  and 


- 114  - 


Step  4.  (Normal  basis  generation): 

Xo,jk+i  = r3  (pnmal  basis)  — > (1101)  (normal  basis). 

The  operations  needed  in  the  algorithm  are  five  modulo  1 5 additions  in  Step  1 : no  oper- 
ation in  Step  2:  four  PLUT  lookups  (for  r6 , r1 , rli , and  r14)  and  six  modulo  2 additions  in 
Step  3;  and  one  NLUT  lookup  in  Step  4.  ♦ 

This  algorithm  enables  a simple  and  fast  evaluation  of  a leading  element  by  applying 
exponent  arithmetic  to  the  field  element  scaling,  simple  table  lookups,  and  minor  modular 
additions  over  the  field  GF(p).  The  conventional  approach  might  involve  the  linear  feedback 
shift  register  (LFSR)  type  finite  field  multipliers  [Lid86  Chapter  6]  with  slow-clocking  la- 
tency or  the  bulky  parallel  normal  basis  multiplier  [Wan85].  This  module  performs  as  a fun- 
damental building  block,  along  with  a routing  network  which  accomplishes  barrel  cyclic 
shiftings  for  the  remaining  elements  in  a conjugate  coset,  and  is  repeatedly  integrated  to  each 
intermediate  stage  of  a factor-type  FFT  system.  Consequently,  this  module  is  an  economical, 
high-performance  VLSI  component. 


CHAPTER  6 

THE  PIPELINE  POLYNOMIAL  RNS  PROCESSOR 
WITH  FERMAT  NUMBER  TRANSFORM 

The  application  of  Polynomial  Residue  Number  Systems  (PRNS’s)  in  complex  multi- 
plication offers  tremendously  low  complexity  within  the  digital  signal  processing  (DSP) 
area.  Unfortunately,  isomorphic  mappings  between  complex  number  and  PRNS  domains 
suffer  from  a nontrivial  transform  problem  which  eventually  precludes  the  inherent  advan- 
tages of  the  PRNS  approach.  A significant  simplification  in  the  mapping  procedure  is 
achieved  by  a FFT-like  scheme  and  a sequence  of  primitive  split-then-add  operations.  These 
operations  originate  from  an  algebraic  congruence  and  a residue  reduction  of  a Fermat  prime 
within  finite  fields.  An  efficient  custom  VLSI  implementation  of  the  F FT- type  multiplier- 
tree  system  confirms  the  advantages  of  the  novel  mapping  algorithm. 

The  complex  arithmetic  of  residue  number  systems  in  DSP  applications  is  a recent  sub- 
ject of  intense  study.  Cozzens  and  Finkelstein  [Coz85]  suggest  an  improved  algorithm  for 
complex  number  approximations  using  algebraic  integers  in  higher  degree  extensions  of  the 
rational  number  Q.  The  PRNS  provides  a special  mechanism  for  complex  number  opera- 
tions[Ska87],  Although  it  suggests  both  performance  and  implementation  advantages,  the 
isomorphic  mapping  structure  between  the  complex  domain  and  PRNS  domain  results  in  an 
awkward  transform  problem.  This  research  effort  demonstrates  a solution  to  the  mapping 
procedure  difficulties  by  applying  the  algebraic  congruence  concept  and  the  residue  reduc- 
tion technique. 

Polynomial  factorization  over  finite  fields  along  with  the  polynomial  version  of  the 
Chinese  remainder  theorem  ( CRT)  are  the  essential  elements  of  PRNS  system  development. 


115- 


-116- 


The  forward  and  inverse  isomorphic  mappings  are  shown  to  be  parallel  to  the  discrete 
Fourier  transform  (DFT).  Givenan/Vth  degree  polynomial/U)  = x'  + 1,  the  equation/?  t ) = 0 
is  solved  by  applying  the  algebraic  congruence  of  Fermat’s  theorem.  Then,  the  primitive 

roots  r:  where  p1  = -l  modA/and  <p  (M)  = IN,  (p  is  Euler’s  totient  function,  are  found.  Con- 
sequently, a FFT-type  fast  mapping  is  obtained  based  upon  the  relationship  of  the  primitive 
roots.  Since  the  basic  operations  in  fast  mapping  are  the  scalings,  rrf  , a novel  technique  is 
applied  to  residue  reduction  which,  in  turn,  simply  translates  the  scaling  operations  into  trivi- 
al shift-then-adds.  As  a result,  the  two-level  reduction  in  complexity  becomes  a reinforce- 
ment of  the  PRNS  system. 

A prototype  VLSI  design  of  a five-bit  multiplier-free  FFT-type  PRNS  processor  has 
been  implemented  using  the  Magic  IC  layout  tool.  This  design  consists  of  13K  transistors 
with  a 6. 1 by  6.0  square  millimeter  footprint.  The  device  also  has  complete  logic  and  timing 
simulated  on  the  HP-DCS  design  tool.  In  terms  of  speed,  cost  and  simplicity,  the  innovative 
approach  of  this  new  design  outperforms  the  conventional  systems  in  current  usage. 

6.1.1  Reasons  To  Use  Fermat  Prime  Number  As  M 

Let  N is  the  least  positive  integer,  such  that  <rv  = 1 mod  M,  and  M defined  by  one  of 
the  following  cases. 

1 ) If  M is  even,  that  is  M having  a factor  of  2,  therefore,  the  maximum  value  of  /V  is  1 . This  is 
not  a promising  result  that  implies  M should  be  odd. 

2)  If  M = 2 - 1 , K a composite  number  pq , w here  p is  a prime,  then 

2pq  - 1 = ( 2p-  1 )(  2p(<?-1)  + 2 p{q~2)  + . . . + 2p+  1 ),  (6.1) 

such  that  2P  - 1 I M . This  means  the  maximum  possible  length  of  the  transform  N will  be 
governed  by  the  length  possible  for  2P  - 1 . 

3)  If  M is  a Mersenne  number  of  the  form  2K  - 1 , where  K is  prime,  Rader  [Rad86a]  showed 
that  transforms  of  length  at  least  2 K exist  and  the  corresponding  a is  -2.  Because  2 K is  not 


- 117  - 


highly  composite  and  therefore,  there  does  not  exist  a fast  FFT-type  implementation  algo- 
rithm. 

4)  If  M = 2k  + 1 and  K odd.  then  3 divides  M and  the  largest  positive  transform  length  .V  is  2. 

5)  If  M = 2*"  + 1 and  K even,  for  K = m2n , where  m is  odd,  then  21'  + I I V/  and  the 
length  of  the  possible  transform  will  be  governed  by  the  transform  length  possible  for 
22'  + 1 . These  number  are  known  as  Fermat  numbers  which  are  opportunities  for  FFT-type 
algorithms. 

6.1.2  Algebraic  Congruence  to  a Fermat  Prime 

Let  M be  a Fermat  prime  where  M is  of  the  form  of  2m  + 1 , m = 2n , n < 4.  According  to 
Fermat's  theorem,  rVM  = 1 mod  Xt  for  all  nonzero  x e Z y , that  is 

.t2"  = 1 mod  M . (6.2) 

The  nonzero  elements  in  the  group  Zw  are  roots  of  Equation  (6.2).  Thus,  the  group  can  be 
generated  by  a primitive  root  a . Since  the  order  of  the  group  is  <p  (M)  = 2m,  the  set  of  the 
roots  of  Equation  (6.2)  is  constructed  as 

R = { a\  a2 a2”  = a0  } . (6.3) 

This  means  Equation  (6.2)  can  be  factored  over  the  ring  as 

2"»_i 

x2m  - l = n<  -r-ai  )•  (6.4) 

i=0 

Equation  (6.2)  also  indicates 

( x2"  )T  = ± 1 mod  M.  (6.5) 

The  roots  of  Equation  (6.4)  can  be  analyzed  for  different  cases.  For  the  case  where  .*2'"~' 
= 1 mod  M,  each  x can  not  be  the  primitive  roots;  otherwise,  the  definition  of  primitive  root  is 
contradicted.  This  leads  to  the  fact  that.t  is  all  elements  except  the  primitive  roots;  that  is.t  6 
R = { a1 1 ( i,  2m  ) = 21 , 2", . . . , 2m_l  ).  On  the  other  hand,  for  x2m~'  = -1  modA/s  2m  mod  A/, 


- 118  - 


.x  is  a member  of  the  primitive  roots  which  are  R = ( a'  I ( < , 2'n  ) = 1 }.  This  means  Equation 
t6.5)  can  be  factored,  over  the  group,  as 

-ym~\ _ I 

.t2""1  + l = n ( x-aZt*[  ) . (6.0) 

i=0 


Equation  (6.6)  states  that  in  the  case  of  N = 2m~l  the  roots  are  primitive  roots  (of  order 
2m  ) modulo  M and  are  of  the  form  of  a2'-’’1  fori  = 0, 1, . . . ,N-  1.  For  the  case  of  tV  = 2m_2  the 

roots  of  equation  ,tv  + 1 = 0 mod  M are  obtained  by  first  finding  a root  a of  order  2m_1  , and 

subsequently  raising  it  to  the  power  2i  + 1 for  / = 0, 1 N - 1 . A similar  procedure  applies  to 

the  sequel  where  N = 2m~k  for  2 < k < m,  k an  integer. 

According  to  the  above  discussion,  the  roots,  r, . of  the  equation  XW= -1  mod  M,  where 
N I 2m_1  , are  solved  and  represented  as  follows: 

m 

ro  = 2*  mod  M , 


r i = 2«  mod  M = r-Q 


n = mod  M=  >o  , (6.7) 

where  i - 0,  1, . . .,  ,V-  1.  Funhermore.  an  additional  relationship  between  roots  is  obtained 
according  to  the  following  example: 


r,v_i  = 2 " mod  M 


= (2m)2  2^  mod  M 
= (-  1 )2  2~i~*)  mod  M 

= rol-  (6.8) 

Further  result  indicates  that  riV-2  = r^3.  Consequently,  a general  form  is  obtained 


(6.9) 


- 119  - 


where  i = 1,2 N. 

Example  6.1  The  roots  of  .rv  h -1  mod  M. 

Let  N = 4 and  M be  a Fermat  prime,  then  r\  = , r->  = rg  , and  r3  = r‘0  . From  Equation  1 6.9 ). 

these  roots  are  also  rewritten  as  r^,  = = rjj3 , and  r\  - rjj5 . ♦ 

6.1.3  The  Multiplier-Free  Fast  FFT-Tvpe  Isomorphic  Mappings 


The  computational  complexity  and  hardware  cost  for  the  multiplication  of  two  polyno- 
mials is  reduced  to  the  cost  of  introducing  the  isomorphic  mappings  found  in  Chapter  4 Equa- 
tions (4.35)  and  (4.36).  The  simplification  of  these  nontrivial  mappings  is  the  major  task  in 
this  section.  Two  methods  are  introduced  — namely.  FFT-type  transform  and  multiplier-free 
scaling. 

The  Fast  FFT-Tvpe  Algorithm.  The  FFT-type  algorithms  are  applied  to  the  mappings 
based  on  the  facts  that  M - 1 is  highly  composite,  and  the  property  which  r ‘v  = -1  mod  M 
where  r e Z v/ . Furthermore,  the  relationship  between  the  roots  of  Equation  (6.6)  is  also  re- 
ferred. 


Let  the  input  and  output  variables  of  the  isomorphic  mapping  are  an  and  Ak , respec- 
tively, where  n.k  = 0,  1, . . . , AM.  Analogous  to  the  Coolev-Turkey  FFT  algorithm,  the  iso- 
morphic mappings  are  derived  as  follows: 

1)  Isomorphic  Forward  Transform 


From  Equation  (4.35),  the  PRNS  format  of  the  input  variable  Un  is 

AM 

Ak  = X a^k 

n= 0 


(6.10) 


Since  Equation  (6.7)  implies  that  rk  = , Equation  (6.10)  becomes 

Ak  = 

n= 0 


(6.11) 


- 120  - 


To  form  the  FFT-type  Transform,  let  be  separated  into  its  even-  and  odd-numbered  points 
as  follows: 


Ak 


X fl,/*2*-1’  + X Qnr*2*'" 

n even  n odd 


(6.12) 


If  a substitution  of  variables  n = 2 d for  n even  and  n = 2d  + 1 for  n odd  is  made,  then  Equation 
(6.12)  becomes 

'Ll  'Ll 

Ak  = X ai*2***'  + X «2*t'<2*lx2**1) 

d= 0 d= 0 


S-l 


= X a2d(rf^  + r2^1  X^iO2)^0. 


(6.13) 


<i=0 


rf=0 


Since  (r2)*  = -1  mod  M and  let  s = r2;  Thus,  Equation  (6.13)  can  be  rewritten  as 


4-1  ~i 

A,  = J>2^(2*+1)  + r“+1  X W*2**” 

d=o  a=o 

= Ak  + r2k+lAk  . (6.14) 

Each  of  the  sums  in  Equation  (6.14)  is  recognized  as  an  iV/2-point  transform.  After  the  two 
transforms  are  computed,  they  are  combined  to  yield  the  /V-point  transform  Ak . This  proce- 
dure continues  until  N is  completely  decomposed;  hence,  log2<V  times  iterations  are  required 
to  complete  the  transforms. 

2)  The  Isomorphic  Inverse  Transform 

From  Equation  (4.36),  the  regular  format  of  the  variable  Ak  is 

AM 

Un  = ,V  1 X Akrkn  . (6.15) 

ifc=0 

Since  Equation  (6.7)  implies  that  rk  = r^+1 , Equation  (6.10)  becomes 


where /V  1 is  the  multiplicative  inverse  of/Vand  n = 0,  1, ...  ,N-l.  Then  Ak  is  separated  into 
its  even-  and  odd-numbered  points: 


an  = N~l(  X ■V"'l(2*+1)  + X Akrn{2M)  ) 

k even  k odd 


(6.17) 


Similarly,  if  a substitution  of  variables  k-2e  for  k even  and  k = 2e  + 1 for  £ odd  is  made. 
Equation  (6.17)  becomes 

y-l  ^1 

an  = N-l('£A2er~n{4e+l)  + X A2e+\r~*{4e+3)  ) 
e=0  e=0 


f-1 


= N~\  r"  X A2e(rrn{2e+{)  + rn  X A2e+l(n-n(2e+[)  ) . 

e=0  e=0 


(6.18) 


2 jV  ^ 

Since  (r~)i  = -1  mod  M and  let  s = r\  Equation  (6.18)  can  be  rewritten  as 


« i 

an  = N~l(  X A&T*2**0  + r"  X Ale+is-*2**"  ) 

e-0  e-0 

= N~\  r*an  + rnan  ) . (6.19) 

Similar  formats  are  found  in  Equations  (6.14)  and  (6.19),  except  for  an  extra  scaling 
operation  ( due  to  AT1  ) is  required  at  the  end  of  the  inverse  transform.  Instead  of  performing 

the  extra  level  of  multiplication,  a simple  scaling  technique  merges  N~l  to  the  final  stage  of 
the  transform.  As  a result,  this  extra  level  of  latency  is  eventually  eliminated.  A detailed  de- 
scription of  the  technique  is  show  n in  the  following  section. 

The  Multiplier-Free  Scalings  Technique.  Previous  discussion  shows  that  the  funda- 
mental operations  embedded  in  the  FFT-type  isomorphic  mappings  are  the  scaling  opera- 
tions in  the  form  of  r^dj  and  rjAj  modulo  M,  where  dj  or  Aj  is  an  isomorphic  mapping  pair 
and  r;  is  a root  of  rv  = -1  mod  .V/.  By  virtue  of  the  fact  that  all  roots  r,  are  in  the  form  of 


- 122  - 


binary  powers  ( see  Equation  (6.7) ),  these  scaling  operations  are  easily  facilitated  to  shift  - 
then-adds  instead  of  performed  as  regular  modular  multiplications. 

A number,  x e Z.v/ , is  partitioned  and  denoted  as  an  n + 1 bit  word  as 

.t  = -o'/  2e  + xl  *-*  [ -t//  : xl  ; e } (6.20) 

where  .r//e  Zf,r  = [(A/-l)/2e]  + l.and.v/,  e Z2« -Hence,de  Z.v/  is  partitioned  and  is  of  the 
form  a = [ &h  \ aL  ; (m  - ((2/  + l)m)/N)  ].  Consequently,  rl  a is  evaluated  efficiently  as 

( li*  1 )m  . 

r,  a mod  Xi  = 2~ v-  ( aH  2(m  “ > + aL ) mod  M 

= ( 2m  (1h  + 2“ ) mod  M 

= ((  2m  + 1 )tf//  - a//  + 2^^  aL  ) mod  M 

= ( 2T~^1  aL  - an ) mod  Xi.  (6.21) 

The  result  indicates  a significant  simplification  of  the  product  computation  r,  a.  This  is 
achieved  by  partitioning,  then  adding  (shift-then-add). 

The  scaling  factor  /V-1  of  Equation  (6. 1 9)  can  be  treated  in  a similar  manner.  First  of  all, 
N~x  is  evaluated  by  applying  the  definition  of  inverse  of  N,  which  is  N N~x  = 1 mod  XI.  Since 
Xi  = 2m  + 1 and  N is  a factor  of  M - l , then  N = 2m~l  where  1 = 0,  1 , . . . , m - 1 . Accordingly, 

2m~l  .V-1  = 1 mod  M.  (6.22) 

Furthermore,  since  1 h -2m  mod  M,  Equation  (6.22)  indicates  that  N~l  = -2‘.  Thus,  if  a is 
partitioned  to  the  form  a = [ an  : cll  ; (m  - i)  ],  then 

N~la  mod  M = ( -2m  a//  -2 1 ^l  ) mod  XI  = an  - 2‘  aL  mod  Xi.  (6.23) 


6.2  A Pipeline  Third-order  Polynomial  RNS  Processor 


Under  the  consideration  of  acceptable  data  accuracy  and  low  complexity  of  the  system 
design,  let  N = 4.  Hence,  multiplication  of  two  complex  numbers  is  performed  as  a multipli- 
cation of  two  third-order  polynomials  modulo  .t4  + 1 . The  congruence  .v4  + 1 = 0 mod  Xi  has 


- 123- 


t'our  distinct  solutions  ( r,  r3 , r , r7  ) in  Z v/ . By  applying  FFT-type  transforms  and  the  con- 
gruence property  of  the  roots  due  to  the  fact  that  r4  = -1  mod  .V/.  the  second  stage  of  the  iso- 
morphic forward  transform  A of  a is 

Ao  = + r Ao; 

A i = A i + r3  A i ; 

Aj  = Ai  + r*  Ai  = A2  + (— rMy 
and 

A3  = A3  + r1  A3  = A3  + {-r’)AT>.  (6.24) 

Similar  procedure  along  with  the  fact  that  rb  =-r  mod  M is  taken  to  the  evaluation  of  the 
first  stage.  It  turns  out  that  the  terms  a\  , A,  in  the  first  stage,  where  i = 0,  1,2,  3,  can  be 
redefined  as  follows: 

Ao  = tfo  + r2  a.2, 

Aq  = a\  + r2  <33 ; (6.25) 

A 1 = ao  + r6  (32  = ao  — r <42  > 

Ai  = a\  + r6  a-},  - ^1  - r2  ai\  (6.26) 

A2  = ao  + r2  d2  = Ao, 

A2  = d\  + r2  Ui  = Aq\ 
and 

A}  = ao  + rb  <J2  = a.Q  - r CI2  = A\, 

II  s - 

Ai  = a\  + r ai  = a \ - r1  <23  = A 1 . 

A two  stage  pipeline  forward  transform  is  demonstrated  in  the  following  block  diagram  in 
Figure  6.1. 

With  similar  procedure  along  with  the  fact  that  s (r4)-1  1 mod  M and  the  result  in 

Example  6.1,  the  inverse  transform  is  expressed  as 


- 124- 


and 


STAGE  1 


Fa 


STAGE  2 


tfO  <22  <2l  ^3 


Ao  At  A2  A3 


Figure  6. 1 The  Isomorphic  Forward  Transform  F4 


flo  = A^'lAo  + A2  + At  + A3  ); 

at  = N~[{r~xAo  + r_5A2  + r~2(r~lA\  + r-5  A3)) 

= .V-HHlAo  - A2)  + r-3(At  - A3)) 

= Ar1(-r3(A0-A2) + (-/*)(  At  - A3)); 
a2  = AG'(r-2Ao  + r10A2  + At  + r10A3)) 
= ^(-rlAo  + A2)  + r2!  At  + A3)); 


a3  = AT’CAq  r 3 + Air  15  + r 6 ( A 1 r 3 + A3  r~15)) 


= N~\-r(  A0  - A2)  + (-r^KAt  - A3)). 


(6.27) 


Since  the  extra  scaling,  N 1 , a three  stage  pipeline  inverse  transform  is  demonstrated  in  the 
following  block  diagram  in  Figure  6.2. 


- 125 


STAGE  1 


STAGE  2 


STAGE  3 


f 


Figure  6.2  The  Isomorphic  Inverse  Transform  F^* 1 


Using  the  multiplier-free  scaling  techniques  discussed  in  previous  section,  the  scaling 
factors  such  as  ± r,  tr2 * , and  tr1  are  derived  as  follows: 

1)  Scaling  by  r and  -r  : Let 

a = [ a//  : <j[_;  (3/4 )m  ] 


such  that 


ra  mod  M = 2m/4  Ql  - <3// 


(6.28) 


and 


ra  mod  M = -2"1/4  a^  + an. 


(6.29) 


- 126- 


2;  Scaling  by  r 2 and  -r1  : Let 

a = [ an  : at ; ( l/2)m  ] 

such  that 

r a mod  M = 2m/2  &L  - ^// 
and 

-r^a  mod  \l  = -2"1/2  + <ih. 

3)  Scaling  by  r3  and  -r  : Let 

a = [ Qh  : Ql  \ fl/4)m  ] 

such  that 

r a mod  ;V/  = 23"1/4  Ql  - <i H 
and 


(6.30) 


(6.31) 


(6.32) 


-r3  a mod  M - -23m/4  + a (6.33) 

A pictorial  representation  of  these  scaling  module  is  shown  in  Figure  6.3. 

A nine-bit  PRNS  arithmetic  prototype  is  implemented  based  upon  the  theoretical  de- 
velopment where  N = 4,  M=  28  + 1,  hence,  N~l  =-2m~2  = - 26  = -r3 . As  a result,  the  three- 
stage  pipeline  in  the  inverse  transform  can  be  reduced  to  two  stages  by  the  following  proce- 
dure. Refemng  to  Equation  (6.27)  where  /V-1  is  replaced  by  -r3 , the  transformed  variables 
at  become 

ao  = -r3 (Ao  + A2  + A\  + A3  ); 

ill  =-r3(r~lAo  + r~5  A2  + r~2(r_1Ai  + r“5A;,)) 

= -^(Ao  + r~* Aj)  + (-A j - r-1  A3) 

= ~^(Ao  - Ai)  + (A3  -A\ ); 

<^2  = -r1  (r-2  Ao  + r_I0A2  + r^(r'2Ai  + r_10A3) 

= - ri Aq  + A2)  + r(  Aj  + A3); 


and 


- 127  - 


M mod  (2m+  1)  = 2*AL-AH  - M mod  (2'"  + 1)  = -2*AL  + AH  r*A  mod  (2"+  1)  = 2^AL-AH 


- F.4  mod  (2"  + 1)  = - 2 iAl  + A„  A A mod  (2"  + 1)  = 2^At  - A„  - r*A  mod  (2"  + 1)  = - 2^"A,  + A„  ' 


Figure  6.3  The  Isomorphic  Mapping  Scalar  Modules 


ci3  — r { Aq  r ' + Ai  r 15+r6(-4ir3  + -43r_15) 


- (^2  - -4q)  + (-r^Mi  - A3). 


(6.34) 


- 128  - 


A two  stage  pipeline  inverse  transform  is  demonstrated  in  the  following  block  diagram  in 
Figure  6.4  . 


Ao  A- 


-43 


STAGE  1 


STAGE  2 


ao  a\ 


<12  ai 


Figure  6.4  The  Isomorphic  Inverse  Transform 

A two-stage  pipeline  9-bit  third-order  PRNS  system  is  developed  and  shown  in  Figure 
6.5  where  the  processing  module  — AMmodM_x  is  a modular  adder  and  multiplication  unit. 


6.3  System  Implementation  of  the  Third-order  Polynomial  RNS  Processor 

According  to  the  theoretical  background  developed  in  previous  section,  a third-order 
polynomial  RNS  processor  is  implemented  using  various  design  technologies.  A two-level 
hierarchical  implementation  approach  is  taken  to  ensure  the  effectness  and  accuracy  of  the 
design.  The  top  level  of  the  design  flow  is  a functional  verification  which  includes  logic  sim- 
ulation and  unit-delay  timing  simulation.  The  low'er  level  of  the  processor  design  is  a topo- 
logical layout  of  transistors  which  is  a physical  implementation  of  the  design.  A trade-off 


- 129- 


czo  a2  as  bo  bi  b\  bs 


Co  Cl  Cl  c3 


Figure  6.5  System  Block  of  the  9-bit  Third-order  PRNS  Arithmetic  Unit  with  FFT- 
Type  Fast  Isomorphic  Mappings 


- 130  - 


between  design  load  and  adequacy  of  test  v ector  generation  is  made  leads  to  the  choice  of  m = 
8 which  makes  M = 257.  As  a result,  a 9-bit  system  is  built  using  the  HP  Design  Capture  Sys- 
tem (HP-DCS)  which  provides  schematic  capture  and  logic/timing  simulations.  For  the  low- 
level  design,  a X -based  CMOS  design  is  accomplished  using  the  Magic  layout  tool  which 
runs  on  a SUN  SPARCstation.  Since  this  level  of  design  is  for  the  proof  of  theory,  a 5-bit 
system  is  built  which  contains  total  of  13,000  transistors  on  a 6.0  by  6.1  square  millimeter 
footprint.  A detailed  description  on  the  design,  implementations,  test  and  system  evaluation 
of  the  third-order  polynomial  residue  number  processor  are  presented. 

6.3.1  A Nine-bit  PRNS  Processor  Development  on  the  HP-DCS 

Svstem  Design  and  Schematic  Capture.  Applying  a top-down  hierarchical  design  ap- 
proach, a top  level  system  schematic  is  captured  and  depicted  in  Figure  6.6  which  consists  of 
both  forward/inverse  transforms  (the  forwMAP’s  and  a backMAP)  and  two  processing 
cores.  These  cores  make  up  a four-cells  modular  multiplier  banks  (the  mul9modp’s)  and  a 
4-cell  modular  adder  banks  (the  add9MODp’s).  A microcode  control  bit  (selma)  is  used  to  set 
the  multiplexer  arrays  to  select  either  multiplication  or  addition  operations.  A hierarchical 
tree  graph  of  the  system  is  shown  in  Figure  6.7.  The  HP-DCS  support  a broad  variety  device 
library  which  includes  CMOS,  TTL,  and  analog  cells.  In  terms  of  function,  there  are  drivers 
and  buffers,  flip-flops,  decoders,  arithmetic,  counters,  latches  and  triggers,  multiplexers, 
transceivers,  and  basic  logic  gates. 

An  m-bit  third  order  polynomial  RNS  processor  with  the  capabilities  of  manipulating 
polynomial  arithmetic  is  developed  based  upon  the  theories  discussed  in  previous  sections. 
The  detailed  schematic  of  logic  blocks  of  the  system  are  demonstrated  in  the  following  sec- 
tions which  include  the  Modular  Negator,  the  Modular  Adder,  the  Modular  Multiplier,  and 
the  Isomorphic  Forward/Inverse  Mapping  Units.  A sequence  of  the  logic  and  timing  dia- 
grams are  included  to  demonstrate  the  design  procedure  verify  and  system  performance. 

1)  Modular  Negator 


-131- 


Figure  6.6  Top  Level  Schematic  of  the  PRNS  System 


The  purpose  of  the  module  is  to  convert  an  input  re  into  -x,  so  that  it  can  be  used  in 

subtraction.  The  mapping  equation  is  given  by 

-x  = ( M - 1 x I ) mod  M (6.35) 

If  M = 2m  + 1 and  x e Z m,  then 

-x  = ( 2m  + 1 - x ) mod  M 
= ( 2m  + \ + X + \)  mod  M 
= ( 2m  + 2 + x ) mod  M 

where  x denotes  the  bit-wise  complement  (l’s  complement)  of  x.  From  Equation  (6.35) 
shown  above,  one  can  find  that  there  exists  relations  between  input*  and  output  -*.  Given 


132 


Figure  6.7  The  Hierarchical  Design  of  the  Third-Order  Polynomial  RNS  Processor  on  HP  DCS 


- 133- 


am  2m  + dm- 1 2m_1  + . . . + d\ 2 + (M) 
bm  2m  + bm_ i 2m_1  + . . . + 2 + bo 

do 
a\ 

U\  d2  + d\  &2 
d\  U2  a-i  + ( a\  + <22)33 

6m  = ffi  a2  . . . am  + ( a\  + a2  + . . . + am_i  )am  (6.36) 

For  the  case  of  nine-bit  system  where  m = 8,  a commercial  Programmable  Array  Logic 
(PAL)  is  used.  The  HP-DCS  provides  a automatic  PAL  implementation  using  the  Program- 
mable Logic  Device  Design  System  (PLDDS)  software  package.  The  Foreign  Tool  Interface 
(FTI)  of  the  PLDDS  is  used  to  perform  automatic  logic  synthesis  and  device  selection  ac- 
cording the  user  defined  Boolean  equation  file.  A final  PLD  design  can  be  transferred  to  reg- 
ular DCS  design.  A file  contains  Boolean  equations  for  the  modular  negator  is  shown  in  Fig- 
ure 6.8.  A multiwindow  picture  shown  the  device  selection  of  FTI  is  also  demonstrated  in 
Figure  6.9.  For  detail  description  of  the  PLD  design  procedures,  see  reference  in  HP  PLDDS 
User  Interface  Basics. 

In  order  to  improve  interpretation  and  reduce  the  computational  delay  of  negation,  the 
case  of  input. x = 0 should  be  kept  at  zero  rather  than  mapped  to  M.  Thus,  a zero  detect  circuit 
is  required  and  embedded  in  the  design. 

2)  Modular  Adder 

A modulo  M Adder  adds  two  integer  inputs  in  Z.v/  and  maps  their  sum  into  modulo  M 
sum.  The  same  unit  can  be  used  to  do  a modulo  M subtraction  by  using  a NEGATOR  to  ne- 
gate one  of  the  operands.  A Modulo  \1  Adder,  shown  in  Figure  6. 10,  is  actually  composed  of 


.t  = 

-x  = 

then 

bo  = 
bx  = 
i>2  = 

*3  = 


- 134  — 


/*  FTI  Design  foraddSmodp  - negmodp,  using  PAL22V10  */ 

dummy  main  (xiO,  xil,  xi2,  xi3,  xi4,  xi5,  xi6,  xi7,  xi8,  xoO.  xol.  xo2,  xo3.  xo4. 
xo5.  xo6,  xo7,  xo8) 

input  xiO,  xil,  xi2,  xi3,  xi4,  xi5,  xi6,  xi7,  xi8; 
output  xoO,  xol,  xo2,  xo3,  xo4,  xo5,  xo6,  xo7,  xo8; 

{ 

node  zeroO,  xo8n; 

zeroO  = xiO  I xil  I xi2  I xi3  I xi4  I xi5  I xi6  I xi7  I xi8; 
xoO  =!xiO&  zeroO; 

xol  = xi  1 & zeroO; 

xo2  = (!xil  & xi2  I xil  & !xi2)  & zeroO; 

xo3  = (!xil  & !xi2  & xi3  I (xil  I xi2)  & !xi3)  & zeroO; 

xo4  = ( ! xi  1 & !xi2  & !xi3  & xi4  I (xil  I xi2 1 xi3)  & !xi4)  & zeroO; 

xo5  =(!xil  & !xi2&  !xi3  & !xi4&  xi5 1 (xil  I xi2 1 xi3  Ixi4)&  !xi5) 

& zeroO; 

xo6  =(!xil  & !xi2&  !xi3&  !xi4&  !xi5  & xi6  I (xil  I xi2 1 xi3 1 xi4 1 
xi5)  & !xi6)  & zeroO; 

xo7  = (!xil  & !xi2  & !xi3  & !xi4  & !xi5  & !xi6  & xi7  I (xi  1 1 xil  I 
xi3  I xi4  I xi5  I xi6)  & !xi7)  & zeroO; 
xo8n  =(!xil  &!xi2&!xi3&!xi4&!xi5&!xi6&!xi7&xi8  I (xi  1 1 
xi2  I xi3  I xi4  I xi5  I xi6  I xi7)  & !xi8); 
xo8  = !xo8n  & zeroO; 


Figure  6.8  The  Boolean  Equation  of  the  Module  - negmodp 


three  pans.  One  of  them  is  an  m-bit  fast  carry'  lookahead  adder,  another  is  an  m-bit  table-look- 
up modulo  unit,  and  the  third  is  a logic  control  circuit. 

a)  The  m-bit  Fast  Carry- Lookahead  Adder 

It  can  be  implemented  by  conventional  full  carry- lookahead  adder  with  overflow  flag 
(see  Figure  6.10).  The  combinations  of  slice-type  full  adder  with  carry-generate,  carry-pro- 
pagate  outputs  and  a carry-lookahead  generator  achieve  the  fast  operating  goal. 

b)  The  Table  Mapping  Modulo  Unit  ( MDL.v/ ) 


-135- 


Figure  6.9  The  Multiwindow  Demonstration  of  FTI  Device  Selection 


The  modulo  operation  MDLm  is  implemented  by  applying  a table  mapping  modulo 
unit  which  uses  a programmable  logic  device  to  implement  a bit  parallel  output  mapping  of  a 
binary  adder  (see  Figure  6.11).  For  thr  case  that 
S = A + B > M 
where  M = 2m  + 1 , then 


( A + 5 ) mod  M = S-M. 
The  procedure  used  to  realize  the  unit  is 


- 136  - 


an 


am- 1 - ao 


bm- 1 - bo 


t L_ 

FULL 

ADDER 


! adc 

fac  fas 


b 


X m 

J 

OVF 


/'  m 

t 


m-bit 
ADDER 


/m+ 1 

X 

MDLW 


mod 


♦ 

X 


m+1 


Figure  6.10  The  Modulo  M Adder 


i)  Starting  with  the  least  significant  bit  (LSB)  of  S,  complement  all  ‘0”s  up  to  the 

first  ‘1\ 

ii)  Complement  the  first  encountered  ‘ 1 * in  S. 

iii)  Leave  of  other  bits  of  S unaltered  and  set  the  (m+l)st  bit  to  ‘O’. 

The  Boolean  equations  found  from  the  algorithm  shown  above  are 

do  = To 

d\  = sq  s\  + so  si 

d.2  = To  Ti  Ti  + (So+  Si  )$2 


dm- 1 = To  Ti  $2  ■ • ■ Tm-1  + (So  + S}  + . . . 4-  Sm_2)sw-l  (6.37) 

For  the  case  of  m = 8,  a similar  desi gn  - summodp  - is  implemented  by  the  HP-DCS  software. 
The  Boolean  equation  file  is  shown  in  Figure  6.12. 


- 137  - 


Figure  6.1 1 The  MDLv/Unit  Implemented  Using  a Table  Lookup  Method 
c)  Control  Logic  (Compare  Network) 

It  is  simply  a combinational  logic  circuit  which  decides  if  the  output  of  the  module  is  to 
be  a moduloed  or  unmoduloed  one.  The  conditions  for  modulo  are 
S = A + B > M 

which  means 

{ sm  = 1 and  ( Stfi-i  or  Sm- 1 or ...  or  so  ) * 0 } or  { mod  = 1 ). 

A HP-DCS  designed  PAL  is  generated  by  the  Boolean  equation  file  which  is  shown  in  Figure 
6.13  . 

Over  all,  A nine-bit  modular  adder  is  implemented  in  the  HP-DCS  environment  which 
is  demonstrated  in  Figure  6. 14.  The  usefulness  of  the  programmable  logic  device  is  its  ability 
to  combine  a number  of  distributed  logic  gates  ( the  '‘glue”  logics  ) into  a single  device.  By 
reprogrammed  the  device  the  functionality  of  the  gates  is  modified  to  suit  specific  needs.  The 


- 138  - 


/*  FTI  Design  for  add8modp  - summodp,  using  PAL16HD8  */ 

dummy  main  (xiO,  xil,  xi2,  xi3,  xi4,  xi5,  xi6,  xi7,  xoO,  xol,  \o2,  xo3,  xo4,  xo5, 
xo6.  xo7) 

input  xiO,  xil,  xi2,  xi3,  xi4,  xi5,  xi6,  xi7; 
output  xoO,  xol,  xo2,  xo3,  xo4,  xo5,  xo6,  xo7; 

{ 

xoO  = !xiO; 

xol  = !xiO  & !xil  I xiO  & xil; 
xo2  = ! xiO  & !xil  & !xi2  I (xiO  I xil)  & xi2; 

xo3  = ! xiO  & ! xi  1 & !xi2  & !xi3  I (xiO  I xil  I xi2)  & xi3; 

xo4  = ! xiO  & ! xi  1 & !xi2&  !xi3&  ! xi4 1 (xiO  I xi  1 1 xi2 1 xi3)  &xi4; 

xo5  = !xiO&  ! xi  1 & !xi2  & !xi3  & !xi4  & !xi5 1 (xiO  I xil  I xi2 1 xi3 1 

xi4)  & xi5; 

xo6  = !xiO  & ! xi  1 & !xi2  & !xi3  & !xi4  & !xi5  & !xi6  I (xiO  I xil  I 
xi2  I xi3  I xi4  I xi5)  & xi6; 

xo7  = ! xiO  & !xil  & !xi2&  !xi3&  !xi4&  !xi5  & !xi6&  !xi7  I (xiO  i 
xil  I xi2  I xi3  I xi4  I xi5  I xi6)  & xi7; 

) 

♦Note  : & = AND;  I = OR;  A = XOR;  ! = INV. 


Figure  6.12  The  Boolean  Equation  of  the  Module  - summodp 

modular  adder  consists  of  two  TTL  four-bit  adders  (74283’s),  two  PALs  (the  summodp 
which  is  the  table  mapping  modulo  unit  whose  Boolean  equation  file  is  shown  in  Figure  6. 1 2 
; and  the  sumovck  which  accomplishes  the  compare  network  task  whose  Boolean  equation 
file  is  shown  in  Figure  6. 1 3),  and  a nine-bit  multiplexer.  The  multiplexer  module  is  made  up 
by  two  four-bit  multiplexer  (74157’s)  and  gates  which  include  inverter  (7404),  two-input 
AND  gate’s  (7400’s),  and  two-input  OR  gate  (7432).  The  detail  circuitry  can  be  found  in  the 
reference. 

3)  Modular  Multiplier 

The  single  modulus  RNS  multiplier  is  a critical  element  to  the  successful  realization  of 
the  PRNS  processor.  Using  multiplier  devices  which  are  available  on  the  market,  the  Modulo 


- 139  - 


/*  FTI  Design  for  addXmodp  - sum  overflow  checking  */ 

/*  Module  sumovck  using  PAL12L6  */ 

dummy  main  (a8.  b8,  adc,  dO,  dl,  d2,  d3,  d4,  d5,  d6,  d7,  muxs,  08) 

input  a8,  b8,  adc,  dO,  dl.  d2,  d3,  d4,  d5,  do,  d7; 
output  muxs,  08; 

{ 

node  a,  b,  c,  fac,  fas; 

muxs  = or(fac,  a); 

a = and(b,  c); 

b = or(fas,  adc); 

c = dO  I dl  I d2  I d3  I d4  I d5  I d6  I d7; 
fac  = a8  & b8; 
fas  = a8  A b8; 

08  = or(fas,  adc); 


♦Note  : & = AND;  I = OR;  A = XOR;  ! = INV. 


Figure  6.13  The  Boolean  Equation  of  the  Module  - sumovck 


M Multiplier  can  be  realized  in  a very  straight  forward  manner.  There  are  some  special  cases 
which  should  be  taken  into  account  separately  (see  Figure  6. 15  ). 

i)  If  both  inputs  A and  B are  equal  to  2m  then  C (the  result)  equals  1,  since  ( 2m 
2m ) mod  M = 1 for  M = 2m  + 1.  The  detect  circuit  will  enable  the  output  buffer  to  set  the 
result  to  be  1. 

ii)  If  only  one  of  the  input  equals  to  2m  (for  example  A = 2m),  then 

(A  B ) mod  M = ( 2m  B)  mod  M 

= (( 2m  + 1 )B  - B)  mod  M 
= -B. 

The  detect  circuit  will  enable  the  output  buffer  to  set  the  result  to  be  the  negated  B. 


-140- 


Figure  6.14  A Nine-Bit  Modular  Adder 


-141- 


a (0 ) 
a ( 1 ) V_ 
a(2) 
a (3 ) 

a (4 ) rs_ 

a (S ) 
a(6 ) 
a (? ) 
a ( 8 ) 


b(0) 
b(l) 
b (2 ) 
b ( 3 ) 
b ( 4 ) 
b(5)  ^ 


b (8 ) \ 


ere  (8  ) 

cn(9 ) | 
cm( 10)  ' 
cm( l 1 ) 
cm ( 12 ) \_ 


cn(  13)  A- 
cr»(  14  ) 
c»( 15)  T\_ 


CMP 1 

neqmodp 

K 10 


XI  1 

x i 2 
x i 3 
x i 4 
xi5 
x i 6 
xi? 
x i 8 


xo8 

HO? 

x 06 
xo5 
xo4 
xo3 
xo2 

xo  1 
xo0 


a i (8  ) 


ai (?)  ^ 

a (6)  > 

a i ( 5 ) 

a i (4  ) 

a i (3  ) 

> 

a.  (2  ) 

ai  ( 1 ) 

ai  (0) 

CMP4 

nejjjmodp^ 

b i (8  ) 

V 

xo? 

xo6 

bi  (?) 

x i 2 

b i (6  ) 

b i (5 ) 

x i 4 

bi  (4  ) 

xo3 

b i (3  ) 

**  'C 

bi  (2) 

xi? 

xol 

bill) 

b<  (0) 

H 1 

»o0 

CMP  7 
neg  mo  dp 
x i 0 xo8 
x i 1 xo? 
xi2  xoS 
xi3  xo5 
xi4  xo4 
xi5  xo3 
xi6  xo2 
xi?  xol 


cn  ( 8 ) 

cn  ( 9 ) 

cn ( 10) 

cn ( 1 1 ) 

> 

cn( 12  ) 

m 

c 

u 

cn ( 14  ) 

cnl 15) 

cn  ( 1 6 ) 

GROUND 


CMP  10 
nejjmodp^ 

xi  1 
X i 2 

xo? 

X 1 J 

xo3 

x i ? 

xo  1 

X iQ 

sn  ( 7 ) 

■N 

sn  ( 6 ) 

sn  ( 5 ) 

> 

sn  ( 4 ) 

sn  ( 3 ) 

\ 

sn  ( 2 ) 

sn  ( 1 ) 

sn  ( 0 ) 

s 

mux  9 
10(8:0) 
l 1(8:0) 

se  I z ( 8 : 0 ) 

CMP  1 1 


-►c (8:0) 


Figure  6.15  The  Modulo  M Multiplier 


These  special  cases  are  solved  by  using  a shortcut  through  the  Negators  and  some  ele- 
mentary gates,  instead  of  going  through  the  lengthy  data  path  of  all  processing  units. 

If  the  (MSB-1  )th  bit  is  1,  then  this  indicates  the  operand  is  negative  (except  for  the 
2m_1  case)  and  shall  be  negated  before  input  to  the  multiplier.  The  unsigned  product  shall  be 
moduloed  and  negated  if  required.  For  the  modulo  operation,  the  unsigned  multiplier  can  be 
configured  using  the  following  split-field  scheme.  Consider  the  full  precision  unsigned 
product  Mu  (<  2m  bits),  partitioned  as 
Mu  =lmPH+  Pl 


- 142  - 


where  P//  and  Pl  are  the  high  pan  and  low  part  of  the  intermediate  product  .V/u . This  yields 
a modulo  M value  Md 

Md  = Mu  mod  ( 2m  + 1 ) = Pl  ~ PH  ■ 

The  final  result  will  be  ( Pi  - P/i ) or  the  negation  of  ( Pl  - Pit  ) depending  on  the  polarity 
of  inputs.  From  the  scheme  shown  in  Figure  6. 1 5,  the  longest  propagation  delay  is  associated 
with  the  unsigned  multiplication  of  positive  and  negative  operands.  The  delay  is  essentially 
that  of  the  three  Negators,  one  Modulo  M Adder,  some  multiplexers,  and  an  unsigned  multi- 
plication. Note  that  the  unsigned  eight  by  eight  multiplier  (mul8x8)  is  implemented  by  four 
four-bit  multipliers  (74274’s)  and  a couples  of  four-bit  adders.  A detail  circuitry  is  shown  in 
Appendix  B.l. 

4)  Isomorphic  F^ard/Inverse  Mapping  Units 

These  units  are  designed  to  perform  the  the  isomorphic  mappings  between  polynomial 
domain  (or  the  algebraic  integer  domain  for  the  complex  number  approximation).  As  the  dis- 
cussions in  the  previous  sections,  the  mappings  are  expedited  using  the  F FI'- type  algorithms. 
However,  the  fundamental  operations  still  require  the  scalings  by  some  forms  of  the  root  r of 

the  equation  Tv  + 1=0.  These  scalar  modules,  as  shown  in  Figure  6.3,  include  ± r,  ±r , and 
ir3 . With  a special  treatment,  the  multiplier-free  scalings  are  achieved  by  simply  transform 
tasks  of  addition  or  subtraction.  A sample  of  one  of  the  scalar  modular,  r2*,  designed  in  the 
HP-DCS  package  is  depicted  in  Figure  6. 1 6.  Other  scalar  modules  can  be  found  in  Appendix. 

Using  these  scaling  elements,  the  isomorphic  mappings  are  implemented  in  a two-stage 
pipeline  structure  based  on  the  FFT-type  algorithm.  The  forward  mapping  and  the  inverse 
mapping  are  shown  in  Figure  6.17  and  Figure  6.18,  respectively.  The  latch  element,  lat4b, 
which  is  designed  to  store  the  intermediate  result  of  the  pipeline  is  composed  of  a series  of 
TTL  latch  devices  (74374’s).  Its  detail  design  circuitry  can  also  be  found  in  Appendix  B.l. 

Functional  Verification  and  Timing  Analysis.  The  most  important  and  crucial  task  of 
the  system  development  is  to  verify  the  functionality  and  the  performance  of  the  design.  The 


-143- 


GROUND 


x i (8 : 0)  >■ 


r 


X 1 (0) 
x i (1) 
x l (2  ) 
x i (3  ) 


XO  ( 0 ) 
xo  ( 1 ) 
xo  ( 2 ) 
xo  (3  ) 
xo  (4  ) 
xo  ( 5 ) 
xo  ( G ) 
xo  (7  ) 
xo  (8  ) 


V 

V 

V 

V 

V 
\ 

V 

£ 


V 

V 

V 


X 1 (4  ) 

CMP  12 
n eg mo  dp 
x 10  xo8 
x i 1 x o 7 
x i 2 x o 8 
x i 3 x o5 
x i 4 x o4 
x i 5 xo3 
x i 6 x o2 
x i 7 x o 1 
x i 8 xo0 

CD 

O 

X 

x i (5  ) 

xo  ( 7 ) 

x i (6  ) 

x o ( 6 ) 

x i (7  ) 

xo  ( 5 ) 

X 

CD 

xo  ( 4 ) 

r 

xo  ( 3 ) 

xo  (2  ) 

xo  ( 1 ) 

xo  ( 0 ) 

xo (8 :0) 


add9M0Dpp 
b (0) 

b(  1 ) CMP  IS 
b (2  ) 
b (3  ) 
b (4  ) 
b (5  ) 
b (6  ) 
b (7  ) 
b (8  ) 
a ( 0 ) 
a(  1 ) 
a ( 2 ) 
a ( 3 ) 
a ( 4 ) 
a ( 5 ) 
a ( S ) 
a ( 7 ) 
a (8  ) 


so ( 8 : 0 ) 


scalerRSQ 


->xq  (8:0) 


GROUND 


Figure  6.16  The  Scalar  Module  RSQ 


HP-DCS  supports  both  logic  and  timing  simulation  by  the  DVI  software  package  ( the  DVI 
stands  for  HP  Design  Verification  Interface  for  HILO-3)  which  is  to  simulate  circuits  created 
by  the  HP-DCS. 

The  basics  of  the  DVI  includes:  a)  Waveform  Display  shows  how  to  use  simulation  fea- 
tures to  examine  waveforms;  b)  Simulation  Errors  covers  simulation  failure,  debugging,  and 
re-simulation;  c)  Min  Time  Simulation  deals  with  minimum  time  delay  simulation  parame- 


144 


d 


Figure  6.17  The  Isomorphic  Forward  Mapping  Module 


145 


Figure  6.18  The  Isomorphic  Inverse  Mapping  Module 


- 146  - 


ters  and  how  to  compare  simulations  with  differing  parameters;  d)  Functional  Modeling  in- 
troduces modeling  through  the  use  of  the  HP-DVI  model  template;  and  e)  Memory  Models 
and  Mapping  Files  discusses  memory  model  building  and  its  use  in  simulation. 

An  illustration  shown  the  logic  and  timing  information  of  the  nine-bit  third-order 
PRNS  processor  is  demonstrated  in  Figure  6. 19.  This  diagram  shows  the  hexadecimal  repre- 


TIME  BASE  * 
EVENT  GRID  - 

Ins  MAX  SIM  TIME  - 
Ins  PULSE  WIDTH  * 

0 1000 

i L 

I0US 

10r»s  L>st  end  time  * 
2000  3000 

9256ns 

4000 

5000 

6000 

i 

7000  8000 

i i 

9000  1000 

II 

I 

1 

T— 

SIGNAL  NAME 

l 

2 

1 2 

1 

2 

a0(8:0)  <H= 

(|5D 

|00 

132 

JED 

190 

l<5 

123 

Ida 

5A 

P 

DA  5A 

il (8:0)  <H« 

flrr 

m 

IEE 

]0I 

176 

132 

100 

|F£ 

A6 

'~l» 

FE  A6 

>2(8:0)  <H- 

FI  100 

|3F 

IE 

|43 

ICC 

IDE 

|AB 

[00 

100 

' l» 

00  100 

>9(8:0)  <H- 

f|gg 

1100 

l<5 

|77 

In 

103 

100 

|9F 

01 

nr- 

9F  01 

b0(6:0)  <H* 

n<i 

[04 

l« 

156 

[08 

|AB 

1 32 

IPE 

09 

l» 

DF  09 

b! (8:0)  <H« 

F|DD 

|73 

137 

121 

100 

156 

1«!L 

' |C7 

99 

■■  l« 

C7  99 

b2 (8:0)  <H* 

flEB 

IBS 

IE3 

|77 

151 

190 

m 

IT00 

1 

1. 

1001 

63(8:0)  <H* 

33E 

I9C 

Iiz  ~ 

_[®0 

187 

IDD 

nos 

11 

F7 

1* 

9 F7 

selma  <■« 

L.  1 1 1 1 1 1 1 

r 

1 0 

>(0(8:0)  «H> 

030AC 

1085 

PFA 

II09F 

11025 

|09E 

11002 

1(066 

I01 

nnr» 

0660IF 

>(1(8:0)  *H> 

0I0EF 

■024 

1033 

1052 

11071 

1(091 

Il  0C0 

11040 

1075 

l«»» 

04D075 

>(2(8:0)  -H> 

0 1 05G 

1099 

' «0F7 

■ 087 

10C2 

||0F4 

11000 

I0IE 

|00F 

In* 

0IF00F 

>(3(8:0)  -H> 

0 II 084 

1090 

II0AG 

■ 09B 

11007 

|0F3 

|0EF 

|094 

Iks 

|**» 

094  0C5 

b(0(8:0)  *H> 

0J0C4 

II04C 

11006 

1042 

1002 

1001 

|00E 

1025 

|0FE 

|*KI 

0250FE 

bf  1(8:0)  «H> 

011001 

1005 

|0F9 

II 03  8 

1055 

1042 

[070 

1076 

“[035 

[iii 

078035 

b(2(8:0)  -H) 

01080 

|04F 

1013 

|0B« 

I0E2 

1009 

1103  A 

[03C 

||  00  7 

|**« 

03C007 

bf 3(8:0)  -H> 

0|0A0 

11071 

11007 

1025 

1019 

10BF 

11010 

I0AI 

|0EC 

|x«« 

0A10EC 

ts0(8:0)  -H) 

07| 11020 

lai 007 

llfl*7B 

I3II0DG 

I1I03C 

I309E 

1 1107 1 

111880 

Ml0 

w 

u**» 

0B00A4 

ts 1 (8:0)  -H) 

0710I0EF 

|1 1 0B4 

I9I0GA 

lllll*« 

191 060 

III03D 

'!  ll|0«o 

'III8E5 

TP' 

1 11444 

0F5021 

t s2(0 :0)  -H) 

040(|08B 

101008 

19100 

-11111098 

l*09A 

1911006 

I0II0F0 

11830 

— I0. 

9 11444 

03D069 

ts3(8:0)  -N) 

090I02E 

till05 1 

1IIII08G 

■ irniKo 

I3II0AF 

np3r 

1^1 0E2 

l|0|88B 

|e||e 

■fi 

II*** 

0B80EB 

t * 4 ( 8 : 0 ) -H> 

0<f06F 

II 1000 

lp9F 

>061 

H OF  7 

IfMF 

»8CB 

1098 

jp 

|R*ff 

09B0IC 

t s5 (8 : 0 ) «H> 

0IJI0F0 

■ 029 

11028 

|08D 

II0C6 

■ 0D3 

II02F 

II0C5 

|0AI 

|x*» 

0C50AA 

t s6  ( 8 : 0 ) «H> 

04|0FG 

|0E0 

■ 009 

ll|03C 

I0A3 

II0CC 

1047 

I05B 

|8I! 

[444 

050016 

t s 7 ( 8 : 0 ) «H> 

0211023 

>000 

III0AD 

I0E0 

11020 

nosi 

■ 0FF 

■ 034 

J0B 

|**« 

034 0B0 

c 0 ( 8 : 0 ) *H> 

0 11IH03S 

■ 004 

■ 1HI0EC 

11000 

1 1 III  039 

|1|0F0 

H|0BC 

IIII0B8 

141 

005 

ll«»» 

088005 

c 1 (0:0)  «H> 

aiupi 

. 110^5. 

" IIIIIbaf 

1111000 

111111029 

1111069 

iiifkb 

III0C4 

un 

BIT 

0C405A 

c2(8:0)  «H> 

01 II 11076 

H0F8 

mu  09i 

ih0ba 

.190082 

llll  06  □ 

I110CE 

mi  00 

nm 

02E 

||*»* 

10003E 

c3(8:0)  *H) 

0iii0ec 

B09B 

mu  004 

111077 

111111006 

JBI0E0 

IH0IB 

IH0AB 

pi 

396 

■ 1444 

0A8096 

delta  time 

394850ps 

rr  j tTTlftftl; 

0 

1000 

2000  3000 

4000 

5000 

6000 

7000  9000 

9000  10000 

Figure  6.19  The  Logic  and  Timing  Simulation  Result  of  the  Third-Order  PRNS  Processor 


sentation  of  the  results  along  the  processing  pipeline  which  are  indicated  by  various  signal 


- 147  - 


names,  such  asaO  -a3,  bO-b3,cO-c3,  ufO-af3,  tsl  — ts7  etc.  The  horizontal  axis  indicates 
the  timing  frame  which  represents  the  time  elapse  needed  for  an  input  stimulus  to  reach  a 
certain  point  of  the  pipeline.  In  this  diagram,  the  time  base  equals  to  1 nano  second.  The  total 
package  of  the  simulation  for  each  of  the  elements  of  the  processor  is  attached  in  Appendix 
B.2.  A table  which  summarizes  the  timing  information  of  the  processor  is  shown  in  Table  6. 1 
where  the  highlight  item  shows  the  propagation  delay  of  the  PRNS  processor. 


Table  6.1  The  Timing  Information  of  the  Simulation  of  the  PRNS  Processor 


^\Delay  Time 
Module  \ 

min. 

max. 

Delay  Time 
Module 

min. 

max.  I 

1 

lat4b 

6.95 

7.05 

add9modp 

45.45 

45.85 

mux9 

8.99 

9.05 

mul8x8 

71.25 

71.88 

negmodp 

17.52 

17.91 

forwmapn 

88.96 

89.12 

summodp 

23.09 

23.12 

forwmapn  (elk) 

178.96 

189.12 

sumovck 

24.69 

24.98 

il  backmapn 

|j 

93.06 

93.53 

scalerNR 

34.07 

34.13 

backmapn  (elk) 

201.96 

213.12 

scalerNRSQ 

34.02 

34.13 

mul9modp 

172.06 

172.91 

scalerNRCUB 

39.92 

40.75 

scalerRCUB 

55.82 

55.97 

sm4TRNS 

367.06 

378.03 

scalerRSQ 

56.82 

56.96 

scalerR 

63.75 

63.88 

*Note:  Time  Base  = ns 


- 148  - 


6.3.2  A Five-bit  PRNS  Processor  Development  on  the  Magic  CAD  Tool 


A transistor  level  implementation  of  the  third  order  modulo  17  polynomial  RNS  arith- 
metic unit  is  done  on  the  MAGIC  CAD  tool  in  a SUN  SPARCstation.  By  applying  2 -u  m 
single-poly  double-metal  CMOS  technology,  this  unit  consists  of  1 3.002  transistors  and  has 
footprint  of  6, 1 50  mm  by  6,050  mm.  Two  transistor-level  designs  within  the  development  of 
the  PRNS  processor  which  include  a 5-bit  modular  multiplier  as  well  as  a complete  5-bit 
polynomial  RNS  arithmetic  system.  In  this  hierarchical  top-down  design  approach,  a large 
number  of  the  subcells  which  are  used  to  build  the  modular  multiplier  can  also  be  used  in  the 
design  of  the  PRNS  processor.  Even  the  modular  multiplier  itself  becomes  a major  subcell  of 
the  PRNS  processor. 

The  Five-Bit  Modular  Multiplier. 

1)  Flgorplan 

By  applying  the  design  database  created  in  the  HP-DCS  design,  the  schematics  of  the 
modular  multiplier  in  Figure  6. 1 5 is  developed.  Consequently,  the  block  placement  and  wir- 
ing diagram  of  the  modular  multiplier  is  shown  in  Figure  6.20  where  the  design  consumes 
1138  transistors  (include  both  n-type  and  p-type  transistors)  with  a dimension  of  1010A  by 
1400A  and  has  29  I/O  connections. 

The  system  blocks  includes:  1 ) lat.mag  (latches);  2)  palneg.mag  (PLA  version  of  nega- 
tors); 3)  mu,\5.mag  (5-bit  2-to-l  multiplexer);  4)  tritr3.mag  (tri-state  3-bit  transmitter/re- 
ceiver); 5)  multip.mag  (3  by  3 multiplier);  6)  sumall.mag  (modulo  M adder);  and  7)  pa- 
din, mag,  padout.mag,  padgnd.mag.  padvdd.mag  (I/O  pads  and  power/ground  pads)  and  in- 
ternal power/ground  rails. 

The  Input/output  signals  of  the  system  are:  1)AA0-AA4  (5-bit  inputs);  2)BB0-BB4 
(5-bit  inputs);  3)CC0-CC4(5-bit  outputs);  4)T0-T5  (Test  bus);  5)VSS  (Ground);  6)VDD 
(Power):  7)CLKIN  (Clock-in);  8)CLKOU  (Clock-out);  9)R/W  (Test  bus  read/write); 


— 149- 


TOTAL  TRANSISTORS  - 1138 

DIMENSION  - 1010  LAMBDA  X H00  LAMBDA 


AA2 


AA3 


AA4 


T5 


TEST 


VSS  AA1  AA0 


CLKIN 


BB0 


BB 1 


o 

£ 


Lal_ 


palneg 


mux5 


mux5 


1 at 


palneg 


mux5 


mux5 


xor 


mult  ip 


palneg 


palneg 


mux5 


VDD 


a 

z 


BB2 


BB3 


BB4 


R/H 


T0 


T1 


T2 


T3 


1 at 

VSS 

CC0 

CC1 

CC2 

CLKOU 

CC3 

VDD 

CC4 

I at. mag:  5-bit  data( input/output)  latch 

palneg. mag:  negator  implemented  with  PLA 

muxS.mag:  5-bit  multiplexer 

tritr3.mag:  3-bit  tri-state  transmitter/recei ver 

multip.mag:  3 by  3 multiplier 

sumall.mag:  modulo  p adder 

POWER ,GND:  5-Volt  DC  power,  ground 


Figure  6.20  Floorplan  of  the  Five-bit  Modular  Multiplier 


10)TEST  (Test  mode  enable).  The  data  flow  of  the  design  is  from  top  to  bottom  and  local 
power/GND  rails  to  each  cell  are  horizontally  routed. 

A tree  graph  shown  in  Figure  6.21  summarized  hierarchically  the  components  required 
to  build  the  modular  multiplier.  The  top-bottom  tree  shows  the  components  used  in  different 
levels  of  the  design  and  also  indicates  the  number  of  cells  used  in  each  module.  A list  of  mag- 


-150- 


E5 

►— t 

_J 

Q_ 


a 

S 


CO 

I 

in 


o 

s 
►— « 
in 

H 

a! 

O 


3*f 


cn 

I oi  «o 
- - 03  « S 

• 6 6 «*  6 • 

OJ  • • S • Q. 

o—o.  • m -a 
C — — l_  <_  o 

*•  ^ ■*■»•**  ■*■*  ■ 

« 3 3 k.  Z "o 

a.  m &♦»■*»  s 


S’. 


'a  • in  -a 

« ♦*  X -O  • 
<4-  re  3 (O 
N — C «•- 


t-i 

cn 


* * »«  l.  e e 

s s s o . . 

• • •4*X‘4- 

HTIM  k a 3 
TJ  -O  X O 4-»  _Q 

S S i i - u 

c c a.  — t3 


B 

cn 

cn 

1 


» 2*  c 


e O)  03  e 

n S fO  tO  • 

m • m ■ cm 

• CM  • • TJ 

UT3WT  C 
O C 4-  I-  « 


Figure  6.21  A Hierarchical  Design  of  the  Five-Bit  Modular  Multiplier 


- 151  - 


ic  file  name  of  cells  is  also  included.  Any  bigger  cell  can  be  built  by  “stacking”  a number  of 
subcells  which  is  called  the  primitive  building  cells  such  as  xor.mag  (exclusive-OR  Magic 
cell),  nand2.mag  (Two-input  NAND  Magic  cell)  etc.  For  example,  a cell.  tritr3.mag  1 3-input 
tn-state  trans/receiver),  is  made  of  three  trier's  (single-bit  tri-state  trans/receiver)  which  is 
also  made  of  two  tribuf’s  (single-bit  tn-state  buffer). 

2)  Logic  Building  Blocks 

The  logic  building  blocks  of  the  system  are  designed  based  upon  the  HP- DCS  designs. 
Most  of  the  detailed  Magic  layouts  are  demonstrated  in  Appendix  B.3.  A brief  listing  of  these 
elements  are  shown  below  along  with  the  description  of  some  basic  design  methodology  of 
gates  such  as  inverter  (inv),  NANDx  (e.g.  nand2).  I/O  pads  (e.g.  inpad)  and  the  Program- 
mable Logic  Array  (PLA)  device. 

The  Complementary  MOS  (CMOS)  circuit  (structural)  designs  of  an  inverter  and  a 
two-input  NAND  gate  is  shown  in  Figure  6.22.  For  the  inverter  cell,  when  there  is  a ‘0’  on  the 


Vdd 


AB 


: n-well 


GND 


1)  CMOS  Inverter 


2)  CMOS  Two-Input  NAND 


Figure  6.22  Basic  CMOS  Cells 


input,  there  is  a ‘1’  at  the  output.  In  general,  the  lower  transistor  (n-type)  only  has  to  pass  a ‘0’ 


-152- 


while  the  upper  transistor  (p-type)  has  to  pass  a ‘ 1’.  At  any  moment  (except  the  short  tran- 
sient period)  the  current  never  flow  from  Vdd  to  GND.  This  power  saving  feature  is  the  pri- 
mal advantage  of  CMOS  technology.  For  the  NAND  gate,  the  ‘O’  propagates  to  the  output 
only  when  both  A and  B turn  on  the  lower  n-type  transistors.  A Postscript  (see  Appendix 
B.4.2)  printout  of  a Magic  layout  cell,  nand2,  is  demonstrated  in  Figure  6.23. 


Figure  6.23  A Two-Input  CMOS  NAND  Gate  Magic  Layout 


Another  important  device  used  in  the  design  is  PLA  where  a typical  PLA  uses  an  AND- 
OR  structure  similar  to  that  shown  in  Figure  6.24.  This  implementation  also  shows  clocks  to 
latch  inputs  and  outputs.  The  basis  for  a PLA  is  sum  of  products  form  of  representation  of 
binary  expressions.  The  most  straight-forward  PLA  design  uses  a pseudo-nMOS  NOR  gate 


- 153  - 


clk2 

< * 
i 

inputs  outputs 


Figure  6.24  An  AND-OR  Programmable  Logic  Array 

which  has  the  advantages  of  simplicity  and  small  size.  However,  its  disadvantage  occur  due 
to  the  static  power  dissipation  of  the  NOR  gates. 

The  circuit  diagram  of  an  example  PLA  (Figure  6.25)  will  help  to  clarify  the  structure 
and  function  of  the  AND-OR  planes  of  the  PLA.The  input  register  bit  for  each  input  path  is 
formed  by  a pass  transistor  clocked  on  elk  1 leading  to  both  inverting  and  noninverting  super 
buffers.  The  outputs  of  the  AND  plane  are  formed  by  horizontal  lines  with  pull-up  transistors 
(whimppy  pull-up)  at  their  leftmost  end.  The  function  of  the  PLA’s  AND  plane  is  then  deter- 
mined by  the  locations  and  gate  connections  of  pull-down  transistors  connecting  the  hori- 
zontal lines  to  ground.  Each  output  running  horizontally  from  the  AND  plane  carries  the 
NOR  combination  of  all  input  signals  that  lead  to  the  gates  of  transistors  attached  to  it.  For 
example,  the  horizontal  row  labeled  R3  has  three  transistors  attached  to  it  in  the  AND  plane 

(that  is  A + B + C ).  If  any  of  these  inputs  is  high,  then  R3  will  be  pulled  dow  n toward  ground. 
Thus,  R3  is  the  NOR  of  (A  + B + C)  which  is  R3  = A + B + C . 

Once  again,  each  of  the  outputs  of  OR  plane  is  the  NOR  of  the  signals  leading  to  the 
gates  of  all  transistors  attached  to  it.  For  example,  both  R1  and  R3  lead  to  the  gates  of  transis- 
tors leading  from  the  output  line  z2  to  ground.  Thus,  z2  = INV(NOR(  R 1 . R3))  = OR(R  1 . R3) 


AND 

PLANE 


OR 

PLANE 


clkl 


LATCHES 


LATCHES 


- 154  - 


Figure  6.25  A Pseudo-nMOS  Programmable  Logic  Array  Design 


= R1+R3  = (A  + C)  + (A  + B + C ) = AC  + A BC  , which  appears  directly  as  the  sum  of 
products  canonical  form  of  Boolean  functions  of  the  PLA  inputs,  as  the  OR  of  AND  terms. 

Of  all  the  CMOS  circuit  structures,  Input/Output  (I/O)  structures  require  the  most 
amount  of  circuit  design  expertise  in  association  with  detailed  process  knowledge.  It  is  often 
convenient  to  build  I/O  pads  with  a constant  height  and  width,  with  connection  points  at  pre- 
specified locations.  Pad  size  is  defined  usually  by  the  minimum  size  to  which  a bond  wire  can 
be  attached.  Other  design  considerations  which  include  the  sufficient  drive  capability  of  out- 
put pad,  the  susceptibility  to  latch-up  problem,  and  the  reliability  problems  related  to  oxide 


- 155  - 


breakdown  due  to  the  electrical  overstress  ( EOS)  or  electrostatic  discharge  <ESD)  phenome- 
non. Usually  a combination  of  a resistance  and  diode  clamps  are  used  to  limit  this  potentially 
destructive  voltage.  A typical  input  pad  circuit  is  shown  in  Figure  6.26.  Clamp  diodes  D 1 and 


Figure  6.26  Typical  Input  Protection  Circuit 

D2  turn  on  if  the  voltage  at  node  X rises  above  Vdd  or  below  GND.  resistor  R is  used  to  limit 
the  peak  current  that  flows  in  the  diodes  in  the  event  of  an  unusual  voltage  excursion. 

The  follow  ing  elements  are  the  building  block  of  the  five-bit  modular  multiplier  show  n 
in  Figure  6.20.  The  detailed  cell  placement  which  support  the  total  number  of  transistor  used 
and  the  size  of  the  cell  for  each  element  in  the  design  hierarchy  can  be  found  in  Appendix  B.3. 

a)  mtaddcr  — This  cell  is  a modified  full  adder  which  has  four  bits  input  and  a carry-in  bit. 
The  design  consists  of  two  xor.mag’s,  two  and2.mag’s,  three  nand3.mag's,  and  a nandd.mag. 
The  cell  is  a special  design  which  is  dedicated  to  the  implementation  of  the  parallel  multiplier 
cell,  multip: 

b)  multip  — The  three  by  three  multiplier  which  consists  of  six  mfadder  cells  is  capable  of 
performing  fast  parallel  multiplication; 

c)  zfad  — This  is  a regular  2-bit  input,  1-bit  carry-in  full  adder  which  is  a subcell  for  imple- 
menting the  four-bit  adder,  fadderd: 

d)  fadderd  — A four-bit  adder; 


- 156  — 


e)  sumall  — This  is  a modular  adder  which  consists  of  a four-bit  full  adder  cell,  fadder4,  a 
PLA  version  of  modulo  converter  palsneg.  glue  logic  gates  of  sum  overflow  checker,  and  a 
five-bit  multiplexer  mux5.  The  task  of  the  module  is  to  perform  addition  with  the  capability 
of  overflow  checking.  In  the  case  of  overflow,  the  modulo  converter  palsneg  converts  the 
result  to  conform  within  the  of  range  M; 

f)  palneg  — A PLA  implementation  of  the  Boolean  equations  of  Figure  6.8  performs  the  ne- 
gation task  and  its  transistor  layout  is  demonstrated  as  an  example  in  Figure  6.27; 


Figure  6.27  The  PLA  Transistor-Level  Layout  of  the  Negator  Cell  - palneg 

g)  plasneg  — A PLA  implementation  of  the  Boolean  equations  shown  in  Figure  6. 12  which 
performs  the  sum  modulo  converting; 


- 157  - 


h > iai  — A five-bit  latch.  The  latches  hi  give  the  system  the  capability  of  synchronized  opera- 
tion with  outside  world  and  reduce  the  skew  of  signal  flow  in  pipeline  processing; 

i)  mux5  — A five-bit  multiplexer.  The  2-to- 1 5-bit  multiplexer  mux5  is  pan  of  the  decision- 
making circuits  which  are  used  in  many  occasions  in  the  design.  Although  the  bits  of  mux5 
are  not  fully  used  in  some  places,  the  compactness  and  modularity  of  the  cell  is  show  n to  be 
handy  in  hierarchical  design,  such  that  they  compensates  the  waste  in  bits;  and 

j)  tritr.3  — Three-bit  tri-state  transmitter/receiver  buffer.  The  tri-state  buffer  tritr3  is  a bidi- 
rectional transmitter/receiver  circuit  with  read/write  controls.  The  module  is  used  as  em- 
bedded test  circuit  that  make  the  system  observable  and  controllable.. 

The  Negator  and  the  sum  modulo  convener  are  implemented  by  the  Programmable 
Logic  Array  (PLA)  subcells  planeg  and  palsneg  respectively.  All  these  modules  have  a regu- 
lar structure  and  require  high  speed  processing  capability.  The  pseudo  nMOS  PLA  is  a good 
candidate  for  this  application.  A CMOS  design  of  the  modular  multiplier  with  I/O  pads  is 
demonstrated  in  Figure  6.28.  All  of  its  subcells  with  the  Magic  layout  are  collected  in  the 
Appendix  B.3. 

3)  Testability  Plan 

A trade-off  is  made  to  the  system  testability  and  its  hardware  budget.  The  built-in  test 
module  is  implemented  by  using  transmitter/receiver  tritr3r  and  a pair  of  multiplexer  mux5 
(see  Figure  6.29) 

a)  What  to  test 

The  test  point  is  located  at  the  core  of  the  system  multip  which  is  at  the  midstream  of  the 
data  flow.  If  there  is  any  processing  errors  exist,  the  enors  are  detected  immediately  and  the 
location  of  these  errors  can  be  determined  wherever  at  upstream  or  downstream.  It  would  be 
helpful  to  have  test  points  at  each  module  along  the  data  flow.  However,  disadvantages  show 
up  due  to  more  hardware  required  and  gate  delay  added  to  data  propagation. 

b)  How  to  test 


-158- 


Figure  6.28  A CMOS  Magic  Layout  of  the  Five-bit  Modular  Multiplier 


- 159  - 


A B 


I 

T0-T2 

T3-T5 


Figure  6.29  The  Embedded  Test  Circuitry  of  the  Five-bit  Modular  Multiplier 


The  test  circuits  can  be  operated  in  either  normal  operation  monitoring  mode  or  test 
mode.  In  normal  operation,  Observability  is  achieved  by  monitoring  (read  data  out  of)  the 
multiplier  multip.  To  perform  the  monitoring,  the  signal  TEST  is  Grounded  to  make  selec- 
tion of  A and  B as  the  input  of  the  mux5.  READ  signal  is  Inserted  to  make  tritr3s  to  serve  as 
transmitters  such  that  the  output  data  of  multiplier,  multip.  can  be  observed.  In  the  case  of 
TEST  mode.  Controllability  is  achieved  by  forcing  known  states  (e.g.  inputs)  to  the  multipli- 
er multip  and  read  its  output  to  compare  with  the  precomputed  results.  To  perform  the  testing, 
the  signal  TEST  is  set  to  high  (TEST  enable)  such  that  the  normal  inputs  A and  B are  blocked. 
The  inputs  to  the  multiplier  now  are  from  the  testing  data  TO  - T5  by  the  control  of  R/W  sig- 
nal which  set  the  tritr3s  as  receivers.  The  data  to  tested  are  latched  to  the  tntr3s  after  a period 
of  time  which  is  the  processing  delay  of  the  multiplier.  The  timing  diagram  for  the  normal 
operations  and  testing  is  show  n in  Figure  6.30  w hich  includes  three  types  of  timing: 
c)  Normal  operation  mode 

The  two-phase  clocks  CLKIN  and  CLKOU  are  used  to  strobe  inputs  and  outputs  of  the 
modular  multiplier.  The  phase  between  the  clocks  should  be  arranged  to  ensure  that  the  re- 
sults are  strobed  at  the  right  clock  edge.  This  means  that  the  period  of  the  difference  in  phase 


-160- 


A.  NORMAL  OPERATION  MODE: 


1 1 

i 1 

INPUT  DATA:  AA.BB  n-1 

V 1 valid  input  n 1 

3C 

i 

CLKIN 

i 

i 

i 

L 

CLKOU 

l i 

l l 

l 

I 

OUTPUT  DATA:  CC 

] valid  output  n-1 

val id  output  n 

1 

K 

i 

T 1 

B.  NORMAL  OPERATION  MONITORING  MODE: 
1 

TEST  1 

R/H 


t0-t5 

C.  TEST  MODE: 


r 


read 


* 


val id  output  n 


(input  data  clocked-in) 


(result  data  clocked-out) 
(T:  data  processing  time) 


(TEST  disabled) 

(read  check-point  data) 
(check-point  data  shown) 


t0-t5 

R/H 

TEST 


X inPut  X 

write 


OUtpUt 


read 


(test  data,  result  shown) 

(H:  write  data  into  mult  ip) 
(R:  read  data  out  of  multip) 

(TEST  enabled) 


Figure  6.30  The  Timing  Diagram  of  the  Operation  of  the  Modular  Multiplier 


should  be  larger  than  data  processing  time  T of  the  multiplier  to  ensure  correct  results  be  ob- 
tained. 

d)  Normal  operation  monitoring  mode 

In  the  normal  operation  mode,  R/W  (read/write)  signal  stays  low  when  the  CLKOU  is 
high  to  monitor  the  valid  data  of  the  multiplier.  The  signal  TEST  always  stays  low  in  this 
mode. 


e)  Test  mode 


- 161  - 


The  ‘observe’  the  operation  of  the  multiplier,  the  signal  TEST  stays  high  (TEST  en- 
able). The  R/W  signal  is  to  set  the  direction  of  the  tritr3s  which  complies  with  the  ‘input'  of 
the  testing  data  and  the  ‘output’  of  the  result. 

The  Five-Bit  Third-Order  Polynomial  RNS  Processor 

1)  Floorplan 

Similar  to  the  procedure  of  the  development  of  the  modular  multiplier,  the  block  place- 
ment and  wiring  diagram  of  the  polynomial  RNS  processor  is  layouted  and  shown  in  Figure 
6.31.  The  total  design  uses  1 3,002  transistors  with  dimension  of  6 1 50 A by  6050 A and  has  72 
I/O  connections. 

The  system  blocks  includes:  1 ) lat.mag  (latches);  2)  fmap.mag  (the  isomorphic  forward 
mapping)  which  includes  sumall.mags  (the  modular  adder),  lat.mag.  and  a number  scalars 
such  as  scalerR.mag.  scalerNR.mag.  scalerRSO.mag,  scalerNRSO.mag.  scalerRCUB.mag, 
scalerNRCUB.mag:  3)  bmap.mag  (the  isomorphic  inverse  mapping)  which  includes  scaler- 
NEG.mag  in  addition  to  the  subcells  which  are  used  in  the  fmap.mag:  4)  mulmodp.mag  ( the 
modular  multiplier);  and  5)  padin.mag.  padout.mag.  padgnd.mag.  padvdd.mag  (I/O  pads 
and  power/ground  pads)  and  internal  power/ground  rails. 

The  Input/output  signals  of  the  system  are:  1)A0-A19  (5-bit  per  tupple,  four  tupples); 

2) B0-B  19  (the  same  as  inputs  A);  3)C0-C19  (the  same  as  inputs  A);  4)Vss  (Ground);  5)Vdd 
(Power);  6)CLK1  (phase  1 clock);  7)CLK2  (phase  2 clock).  The  data  flow  of  the  design  is 
from  top  to  bottom  and  the  local  power/GND  rails  to  each  cell  are  horizontally  routed. 

A tree  graph  shown  in  Figure  6.32  summarized  hierarchically  the  cells  required  to  build 
the  polynomial  RNS  processor.  Notice  that  all  the  basic  cells  used  in  the  design  are  the  same 
cells  used  in  the  design  of  the  Five-bit  modular  multiplier. 

2)  Logic  Building  Blocks 

a)  Scaler  modules  include  scalerNEG.mag.  scalerR.mag.  scalerNR.mag.  scalerRSO.mag. 
scalerNRSO.mag.  scaierRCL’B.mag.  and  scalerNRCUB.mag.  These  cells  are  the  basic 


— 162  — 


THE  FLOORPLAN  OF  THE  5-BIT  POLYNOMIAL  RNS  PROCESSOR 


TOTAL  TRANSISTORS  : 13,002 
DIMENSION  : G150  urn  X G050  urn 
TECHNOLOGY  : 2-um  Double-Level  Metal  CMOS 


Vdd,  GND:  5-Volt  DC  power,  Ground 
Ax:  Inputs 
Bx:  Outputs 

CK1,  CK2:  Two-phase  system  clocks 

fmap:  Isomorphic  forward  mapping 

bmap:  Isomorphic  backward  mapping 

scler:  Mapping  scaler,  submodule  of  fmap  and  bmap 

sum:  Modular  adder,  submodule  of  fmap  and  bmap 

mulmodpx:  Modular  multipliers 


Figure  6.31  Floorplan  of  the  Five-Bit  Polynomial  RNS  Processor 


-163- 


03 

cn  a)  « 

03  0 0 e 

0 6 E • 

E • • OQ 

•OPQI3 
o cn  □ u 
oiauK  m 

q:  z a:  z 0 

t-  ft.  t_  t_  £ 

13  03  03  13  • 

id  ro  ro  pO  rrs 

u a a o E 

Ui  (A  <A  to  (A 


03 
05  0 
m £ 

•d  E - 
oi  E ■ LD 

03  03  0 • Qi  UJ 

0 0 E Q£  Z Z 

EE  • i-  t_  t_ 

• • _a  u a)  (u 

a.  q.0-  « — — 

0 0 0 0 0 

E E 0 u u o 

«4-  _Q  — WWW 


05  03 

0 03  03  03  0 

E 0 tO  03  0 E 

• e s to  e • 


• cn 


05  • • 

13  — a. 

c — •— 

l/i  0 +*  t= 

o 3 3 t-  i-  3 

Q.  M E -*■»  -t-»  E 


03 

0 

E 


03 

0 03 
E 0 
03  03  • E 

i_  0 cn  « 0"  • 

V E to  E t-  03 

-a  • E • a)  a) 

"O  “O  *1/3  73  c 

0 0 -4-»  X-O*— 
«*_  0 o 0 0 

E N — E <+-  O- 


05 

0 

E 03  03 
05  05  03  * 0 0 
0 ra  n3  l.  E E 
E E E t)  • - 

• • •-♦-•jC‘4- 

{TIVN  l O 3 

“o  -a  x u +»  .a 
c c 3 > « — 
0 0 £ C — 

C C Q-  - T3  -f* 


03 

03  0 

05  0 03  05  E 
d S 0 0 • 

E • S E cvi 

■ OJ  • • “O 

u -a  oj  v c 

o C L_  1X3 

X 0 O O C 


Figure  6.32  A Hierarchical  Design  of  the  Five-Bit  Polynomial  RNS  Processor 


- 164  - 


building  blocks  of  the  isomorphic  forward/inverse  mappings.  These  layouts  are  the  physical 
implementation  of  the  block  diagram  shown  in  Figure  6.3.  These  cells  are  architected  w ith  a 
palneg.mag  (Negator)  and  a sumall  ( Modular  adder)  which  is  a common  structure  of  all  the 
scaler  modules  as  shown  in  Figure  6.33.  Figure6.34  shows  an  example  of  a physical  layout 


Figure  6.33  The  Split-then-Add  Scaler  Module 
of  the  scaler  cell,  scalerRSO.mag: 

b.2)  fmap.mag  — This  cell  is  designed  according  to  the  FFT-type  transform  and  the  mulri- 
plier-free  scaling  scheme  discussed  in  previous  sections  and  performs  the  isomorphic  for- 
ward mapping; 

b.3)  bmap.mag  — This  cell  performs  the  isomorphic  inverse  mapping  which  has  similar 
structure  as  that  of  fmap.mag: 

b.4)  mulmpdp  — The  cell  which  is  developed  in  preceding  section  performs  the  processing 
task  (modulo  multiplication);  and 


-165- 


Figure  6.34  A Transistor  Layout  of  the  Scaler  Module  scalerRSQ 


b.5)  ial  — A five-bit  latch.  The  latches  lal  give  the  system  the  capability  of  synchronized 
operation  with  the  outside  world  while  reducing  the  skew  of  signal  flow  in  pipeline  process- 
ing. 


The  Magic  Design  Rule  Summary. 

1)  Signal  routing  fashions:  a)  power/ground  rails  (fingers)  route  through  each  cell  across  top 
and  bottom  horizontally  via  metal  1 ; and  b)  signals  route  through  cells  vertically  via  polysili- 
con layers.  When  signals  need  to  have  horizontal  routes,  polysilicon  layers  are  contacted 


- 166  - 


with  metal  1 which  can  be  placed  horizontally.  Once  the  destination  is  reached,  metal  1 con- 
tacts the  polysilicon  which  continues  the  routing  (see  the  cell  - mfadder); 

2)  Since  the  power  and  ground  sources  are  placed  horizontally  on  the  top  and  the  bottom  of 
each  cell,  every  other  cell  is  flipped  vertically  to  reduce  the  number  of  horizontal  fingers 
when  the  cells  are  stacked  together  (see  the  cell  - mfadder); 

3)  Symmetrical  cell  layout  makes  it  more  efficient  to  pack  cells  together  by  simply  flipping 
and  rotating  the  cells  (see  the  cells  - zfad  or  mfadder); 

4)  To  improve  the  current  drain,  let  the  thick  power/ground  rails  locate  at  the  chip  boundary 
and  make  the  fingered  rails  go  across  to  each  cell  (see  the  cell  - mulmodp); 

5)  Labeling  each  signal  (include  power/ground  rails)  allows  for  a much  more  efficient  rout- 
ing and  tracing  (see  the  cell  - mulmodp); 

6)  A rule  of  thumb  is  to  avoid  applying  too  much  metal2  routing  in  the  lower  level  (local) 
design.  A clean  low  level  design  will  make  metal2  route  to  anywhere  without  having  difficul- 
ties when  routing  at  the  top  of  the  hierarchy  (see  the  cell  - mulmodp); 

7)  In  the  PLA  design,  ground  rails  are  placed  in  every  three  or  four  gates  in  both  AND-  and 
OR  planes.  This  compensates  the  slow  propagation  inherited  from  n+  diffusion  strips  (see  the 
layouts  of  the  PLA  cells  - palneg  or  palsneg); 

8)  To  avoid  latchup  problems  in  the  CMOS  design,  each  cell  is  embedded  with  improved 
substrate  contacts.  ESD  protected  guard  rings  also  are  included  in  I/O  pads; 

9)  To  apply  full  advantage  of  various  CMOS  design  methods  to  the  system  development, 
there  are  a number  of  technologies  utilized  w hich  include  static  CMOS  gates  (combinatorial 
and  sequential),  pseudo-nMOS  gates,  pass  transistors,  tri-state  logic  and  programmable  log- 
ic arrays  (PLAs). 


CHAPTER  7 

CONCLUSION  AND  FUTURE  RESEARCH 

This  dissertation  examines  some  properties  of  the  finite  computational  structures  such 
as  finite  groups,  rings,  and  fields,  and  extends  its  applications  to  develop  fast  algorithms  and 
to  build  high-speed  low-complexity  very  large  scale  integrated  circuit  (VLSI)  systems  for 
digital  signal  processing.  A finite  polynomial  ring  structure  is  addressed  to  expedite  and  sim- 
plify the  multiplication  of  polynomials.  The  applications  of  this  polynomial  structure  are  in 
the  areas  of  short-block  length  cyclic  convolution  and  complex  number  arithmetic  which 
originates  from  the  algebraic  integer  approximation  concept.  A new  algorithm  of  finite  field 
transforms  with  cyclic  convolution  property  (CCP)  is  investigated.  This  algorithm  combines 
abstract  algebraic  concepts  which  include  normal  basis  representation  and  conjugacy  prop- 
erty with  factor-type  fast  Fourier  transform  algorithms  (FFT)  to  expedite  finite  field  trans- 
forms. 

Applications  in  digital  filtering,  correlation  studies,  radar  matched  filtering,  and  the 
multiplication  of  very  large  integers  are  based  on  digital  convolution,  which  can  be  implem- 
ented most  efficiently  by  the  NTT  method  with  some  constraints.  The  arithmetic  required  to 
accomplish  the  NTT  is  exact  and  involves  additions,  subtractions,  and  bit  shifts.  As  in  the 
case  of  DFT,  fast  algorithms  also  exist  for  the  NTT.  These  transforms  are  defined  on  finite 
fields  and  rings  of  integers  with  all  arithmetic  performed  modulo  an  integer.  The  family  of 
NTT  includes  Fermat,  Nlersenne,  Rader,  pseudo-Fermat,  pseudo- Mersenne,  complex  Nler- 
senne,  and  complex  Fermat  transforms.  They  are  truly  digital  transforms  and  their  imple- 
mentation involves  no  round-off  error.  NTT  implementation  requires  additions,  subtrac- 


- 167- 


168  - 


tions,  and  bit  shifting,  but  usually  does  not  require  multiplications.  Others  possess  fast  algo- 
rithmic structures  similar  to  those  of  the  FFT. 

The  trend  of  VLSI  system  design  utilizes  high  levels  of  integration  based  on  efficient 
algorithms.  To  construct  these  algorithms,  one  must  be  familiar  with  the  powerful  structures 
of  number  theory  and  modem  algebra.  Since  the  structures  containing  the  set  of  integers, 
polynomial  rings,  and  finite  fields  play  an  important  role  in  the  design  of  signal  processing 
algorithms.  Chapter  2 investigates  the  fundamental  properties  of  these  finite  computational 
structures.  The  discussion  begins  by  exploring  the  concepts  of  congruence  and  residue  re- 
duction. Discussion  of  the  Chinese  remainder  theorem  in  relation  to  both  integers  and  poly- 
nomials leads  to  the  realization  that  the  RNS  is  capable  of  carry-free  high  speed  DSP  pro- 
cessing and  a better  understanding  of  the  relationship  between  the  polynomial  algebra  and 
digital  convolution. 

The  parallel  between  the  ring  of  integers  and  the  ring  of  polynomials  over  a field  is  ap- 
parent. Both  are  special  cases  of  an  algebraic  structure.  The  field  of  integers  modulo  a prime 
number  is  the  most  familiar  example  of  a finite  field,  but  many  of  its  properties  extend  to 
arbitrary  finite  fields,  such  as  GF(  pm  ).  Since  the  finite  field  GF(  pm  ) can  be  regarded  as  a 
vector  space  of  dimension  m over  GF(  p ),  the  basis  representations  of  a field  element  in  an 
extension  field  GF(  pm  ) over  its  ground  field  GF(  p ) are  worth  investigating.  Due  to  their 
respective  distinct  features  which  make  them  suitable  for  specific  applications,  this  research 
effort  investigates  three  basis  types:  primal  basis,  normal  basis,  and  dual  basis.  The  normal 
basis  system  is  effective  in  performing  operations  such  as  finding  inverse  elements  and  pow- 
er forming  which  leads  to  useful  applications  in  finite  field  transforms. 

In  the  process  of  constructing  GF(  pm ) from  the  ground  field  GF(  p ),  both  the  power 
form  and  the  polynomial  form  for  the  nonzero  elements  of  GF(pm)  were  developed.  In 
Chapter  3,  the  structures  and  applications  of  these  representations  are  investigated  according 
to  the  usefulness  in  various  fields.  In  the  case  of  small  finite  fields,  it  is  more  efficient  to  use 
the  power  form  to  represent  field  elements  while  applying  the  table  lookup  method  for  its 


- 169 


arithmetic  operations.  Fast  memory  devices  are  good  candidates  for  the  implementation  of 
tables  as  long  as  the  table  capacity  does  not  exceed  the  current  device  technology.  However, 
accomplishing  addition  operations  in  power  form  is  an  awkward  task.  Thus,  Zech’s  loga- 
rithm with  its  special  properties  are  presented  to  solve  this  nontrivial  problem.  This  leads  to 
the  application  of  the  index  calculus  method  for  finite  field  arithmetic.  As  the  size  of  fields 
becomes  larger,  the  conversion  between  power  and  polynomial  forms  creates  a challenge 
task  which  is  traditionally  termed  as  the  discrete  logarithm  and  exponentiation  problem.  To 
avoid  such  a situation,  polynomial  form  arithmetic  becomes  a commonly  used  method  de- 
spite its  difficulties  in  the  multiplication  operation.  Three  different  algorithms  presented  in  a 
step-by-step  manner  for  multiplication  are  investigated  based  upon  their  respective  basis  re- 
presentations (primal,  dual,  and  normal  polynomial  form)  of  finite  fields. 

Finite  digital  convolution  is  a numerical  procedure  which  has  many  powerful  applica- 
tions, such  as  the  implementation  of  finite  impulse  response  (FIR)  and  infinite  impulse  re- 
sponse (HR)  digital  filters,  the  operation  of  auto-  and  cross-correlation,  as  well  as  the  compu- 
tations of  the  products  of  polynomials  and  very  large  integers.  Chapter  4 detailed  the  general 
finite  structure  transforms  necessary  to  carry  out  these  applications.  Recently,  various  re- 
searchers have  proposed  the  use  of  transforms  over  finite  structures  using  number  theoretic 
concepts  for  error-free,  fast,  efficient  computations  of  finite  digital  convolutions  of  real  inte- 
ger sequences  or  complex  integer  sequences.  These  number  theorem  transforms  inherit  a 
major  disadvantage  in  terms  of  the  requirement  of  a rigid  relationship  between  the  dynamic 
range  and  obtainable  transform  length.  Fortunately,  the  transforms  in  finite  structures  also 
are  applicable  overextension  fields  which  lead  to  the  relaxation  of  the  imposed  limitations. 

In  many  applications,  such  as  radar  or  communication,  pre-convolved  data  sequences 
consist  of  complex  quantities.  The  transform  allows  greatly  increased  sample  lengths  over 
those  defined  in  the  ground  field,  GF(  p ),  like  those  of  FNT  or  MNT;  for  a sufficiently  large p , 
one  can  use  this  transform  to  convert  a sequence  of  complex  integers  an  into  the  sequence  A * 
in  GF(  p~ ) for  which  the  inverse  transform  of  A * is  precisely  the  original  sequence  of  com- 


- 170  - 


plex  numbers  an.  Consequently,  filtering  operations  or  convolutions  without  round-off  error 
are  obtained  using  this  transform  on  a sequence  of  complex  integers.  How  ever,  the  applica- 
tion of  the  cyclic  convolution  property  (CCP)  of  DFT  serves  to  reduce  the  computational 
complexity  of  finite  convolution  only  w hen  some  fast  Fourier  transform  (FFT)  algorithms 
are  available  and  applied  to  the  DFT. 

For  the  higher  order  extension  field.  GF(  pm),  transforms  exist  as  long  as  the  sufficient 
and  necessary  conditions  exist.  By  applying  finite  field  properties,  such  as  the  conjugacy 
property,  and  the  basis  representation  of  the  field  elements  to  the  existing  fast  transform  algo- 
rithms, a significant  reduction  in  computational  complexity  is  achieved.  There  also  has  been 
considerable  work  on  high-speed  residue  number  arithmetic  for  use  in  high  data  rate  digital 
signal  processing.  The  structure  of  a complex  integer  ring  is  generalized  to  form  a complex 
residue  number  system  (CR\S)  which  is  a powerful  computational  number  system.  In 
CRNS,  complex  multiplication  is  accomplished  by  means  of  a real  index  calculus  because 
the  selection  of  the  system  parameters  is  not  overly  constrained  by  the  algebraic  structure 
required  by  a NTT.  The  CRNS  is  only  useful  when  the  complex  variables  are  feasibly  repre- 
sented by  the  power  form  over  a second  order  extension  field  and  computed  using  the  index 
calculus  method.  When  complex  variables  are  represented  by  a isomorphic  mapping,  tradi- 
tionally referred  to  as  the  quadratic  residue  number  system  (QRNS),  the  usual  operations  are 
reduced  to  a point-wise  operation.  A further  extension  to  the  higher  order  of  polynomial  sug- 
gested by  the  algebraic  integer  concept  is  the  polynomial  RNS  (PRN'S)  system.  The  principle 
advantage  of  the  extended  RNS  processing  is  its  ability  to  reduce  a complex  multiplication  or 
addition  to  the  calculation  of  a single  integer  multiplication  or  integer  addition  modulo  a 
prime  in  parallel  sets  of  residue  channels.  When  the  primes  are  small,  the  implementation 
complexity  of  the  computation  in  each  of  the  residue  channels  is  correspondingly  small  — 
implying  substantially  high  throughput. 

All  the  finite  field  transforms  and  the  RNS  systems  which  are  discussed  perform  over  a 
ground  field  GF(  p ) w here  p may  be  a Fermat  or  Mersenne  number,  or  p satisfies  some  speci- 


- 171  - 


tied  condition  such  as  having  the  form  of  4/fc  + 1 or  2Nk  + 1.  Without  meeting  these  require- 
ments, transforms  may  exist  in  the  higher  order  extension  field  GF(  pm ).  A number  of  field 
properties,  like  index  calculus,  conjugacy  in  finite  field  elements,  and  basis  representation, 
are  used  to  perform  and  expedite  these  transforms  in  extension  fields. 

Chapter  5 elaborates  the  investigation  of  a fast  finite  field  transform  over  GF(  pm ) 
based  on  the  Good-Thomas  prime-factor  FFT  algorithm.  The  intermediate  results  are  repre- 
sented by  using  a normal  basis  representation.  A significant  computational  reduction  is  ob- 
tained by  applying  a conjugacy  relation  to  a cyclotomic  coset  of  the  intermediate  variables, 
and  by  using  the  cyclic-shift  property  of  p powers  of  the  variables  within  the  normal  basis 
representation.  Once  the  VLSI  architecture  for  the  butterfly  module  of  a cyclotomic  coset 
based  on  the  algorithm  is  developed,  these  module  arrays  are  used  to  form  the  stages  of  the 
fast  transform.  For  the  case  of  p = 2,  m = 8,  performance  analysis  shows  a six-fold  reduction 
in  computational  complexity  which  is  achieved  in  terms  of  an  added  hardware  budget. 

A comparison  between  the  total  number  of  operations  involved  for  the  methods  in  their 
respective  approach  to  the  direct  DFT  which  required  approximately  65,000  operations 
clearly  suggests  a dramatic  reduction  in  computational  operations  when  the  normal  basis 
system  is  implemented.  This  system  takes  advantage  of  the  multiplication-free  shifting  by 
applying  the  normal  basis  representation  to  the  variables  in  the  pipeline  stages.  In  terms  of 
physical  realities,  the  simplified  computational  structure  makes  this  system  readily  adapt- 
able to  a compact  device  without  relinquishing  any  of  its  efficiency.  Interfacing  with  pre-ex- 
isting systems  which  have  representations  in  a different  format,  such  as  the  power  form  or  the 
standard  polynomial  form,  simply  involves  making  a basis  change  at  the  end  of  the  process- 
ing pipeline. 

As  fast  Fourier  transform  (FFT)  algorithms  are  applied  over  the  finite  field  GF(  pm ),  the 
intermediate  results  are  encoded  in  a normal  basis  representation.  The  evaluation  of  the  inter- 
mediate results  w hich  are  shown  to  be  field  elements  of  a cyclotomic  coset  within  the  finite 
field  introduces  a nontrivial  multiply-accumulate  task.  To  mitigate  the  problem,  a fast  and 


- 172  - 


compact  computational  structure  based  on  the  basis-change  algorithm  is  used  to  perform 
these  operations  over  the  finite  field.  This  research  effort  suggests  a new  computational  algo- 
rithm which  involves  the  change  of  basis  of  field  elements  during  the  nontrivial  evaluation  of 
the  coset  elements.  This  finite  field  arithmetic  is  equipped  with  cyclic  shifting  for  scaling, 
simple  table  lookups  for  polynomial  reduction,  and  a series  of  component- w ise  modular  ad- 
ders. Application  of  these  concepts  to  the  existing  fast  transform  not  only  serves  to  simplify 
the  transform,  but  also  leads  to  the  development  of  a very  compact  signal  processing  device 
which  can  be  implemented  as  a building  block  and  seamlessly  integrated  to  a discrete  Fourier 
transform  (DFT)  system. 

The  application  of  the  PRNS  in  complex  multiplication  offers  tremendously  low  com- 
plexity w ithin  the  digital  signal  processing  (DSP)  area.  Unfortunately,  isomorphic  mappings 
between  complex  number  and  PRNS  domains  suffer  from  a nontrivial  transform  problem 
which  eventually  precludes  the  inherent  advantages  of  the  PRNS  approach.  Chapter  6 is  a 
close  examination  of  the  issues  involved  in  the  development  of  the  fast  and  compact  third- 
order  PRNS  machine,  especially  the  VLSI  layout  using  the  CMOS  technology.  A significant 
simplification  in  the  mapping  procedure  is  achieved  by  a FFT-like  scheme  and  a sequence  of 
primitive  split-then-add  operations.  These  operations  originate  from  an  algebraic  congru- 
ence and  a residue  reduction  of  a Fermat  prime  within  finite  fields. 

The  complete  development  of  the  polynomial  RNS  system  with  its  circuit  and  silicon 
layout  not  only  suggests  a significant  reduction  in  both  computation  core  and  number  system 
conversions  but  also  confirms  the  advantages  of  the  novel  mapping  algorithm.  In  terms  of 
hardware  budget,  an  /V-order  PRNS  system  equipped  with  the  novel  approach  has  a com- 
plexity on  the  order  of  O(N).  An  algebraic  integer  processor  of  the  same  degree  would  be  on 

the  order  of  0(N2).  Consequently,  as  N becomes  larger  to  meet  the  requirement  of  high  accu- 
racy computation,  a substantial  saving  is  realized. 


173  - 


7.2  Future  Research 


Cozzens  and  Finkelstein  [Coz85]  introduced  the  idea  of  performing  a computation  in  a 
certain  ring  of  algebraic  integers  which  is  accomplished  by  approximating  the  appropriate 
complex  roots  of  unity  by  elements  of  the  ring.  Games  [Gam85]  also  illustrated  a method  to 
perform  this  approximation  by  elements  of  the  algebraic  integers  ofQ(cy  ).  Due  to  the  nature 
of  these  approximations,  the  dynamic  range  of  the  computation  is  dramatically  reduced 
which  results  in  the  advantageous  application  of  the  Polynomial  RNS  concept.  When  the 
primes  of  the  PRNS  are  small,  the  implementation  complexity  of  the  computation  in  each  of 
the  residue  channels  is  corresponding  small  which  implies  a substantially  high  throughput. 
Therefore,  a worthy  research  task  is  to  develop  an  efficient  isomorphic  mapping  algorithm 
for  higher  order  PRNS  system  to  enable  accurate  complex  arithmetic  under  the  high  degree 
algebraic  integer  approximation. 

Massey  and  Omura  [Wan85]  suggested  a multiplication  scheme  which  performs  the 

product  of  two  elements  in  the  finite  field  GF(  P^).  While  this  system  possesses  the  features 
of  regularity  and  high-throughput,  it  is  cumbersome  (see  Chapter  3).  A finite  field  transform 

also  was  defined  over  GF(  pm  ) by  Blahut  [Bla83]  which  leads  to  this  research  work  on  the 
novel  algorithm  of  applying  the  mathematical  concepts  of  normal  basis,  conjugacy,  and  ba- 
sis-change in  finite  fields.  For  real-time  DSP  applications,  ultra  high-speed  hardware  accel- 
erators are  required  to  implement  the  transform  algorithms.  In  such  a case,  a bulky  but  high- 

throughput  normal  basis  multiplier  over  GF(  Pm ) is  a good  candidate,  but  needs  further  de- 
velopment. Thus,  another  interesting  area  of  research  deals  w ith  solving  the  multiplicative 

F'.vs  function  under  the  consideration  of  the  defining  polynomial  of  the  finite  field  GF(  Pm ). 

Furthermore,  since  an  array  of  Fvb  function  modules  are  required  to  implement  an  m-digit 
parallel  normal  basis  multiplier,  the  future  integrated  circuit  (IC)  technology  of  wafer-scale 
integration  may  be  the  solution  to  this  problem. 


- 174  - 


Finally,  an  interesting  historical  problem  which  relates  to  the  finite  field  transform  of 
DFT  structure  and  CCP  property  is  to  increase  the  dynamic  range  requirement  for  digital 
convolution.  Jenkins  [Jen80]  proposed  a direct  sum  of  L second  degree  extension  fields  for 
wide  range  complex  arithmetic  which  is  similar  to  the  application  of  the  CRT  to  complex 
integer  convolution  by  Reed  and  Trung  [Ree75b],  They  show  the  existence  of  the  isomorph- 
ism of 

GF( P2)  =*  GF(PT)  + GF(P2)  + • . . +GF (Pl)  (7.1) 

where/?  = P\  • Pi-  ••  Pl,  Pi  are  primes  such  that- 1 is  a quadratic  nonresidue.  For  the  higher 

degree  extension  of  GF(  Pm ) where  p=  P\-  Pi-  ■ ■ Pl,  Pi  are  primes,  it  is  required  to  find  the 
following  similar  relationships: 

GF(  Pm ) = GF(Pi,)+GF(P2  ) + ...  + GF(Pl);  (7.2) 

Suppose  oq,  ai,  . . „cll  are  primitive  <ith  roots  of  unity  in  GF(P?),  GF (Pi),  . . . 

,GF(Pl  ),  respectively.  Hence,  (af , ai adL ) = (1,  1, . . 1)  is  the  unity  element  in  the 

external  direct  product  of  the  finite  field  GF(  pm  ) as  shown  in  Equation  (7.2)  where  d I P?  for 
all  z = 1,2,..  ,,Land  (<21,0:2.  • • ..^corresponds  to  an  element  a 6 GF(Pm)  such  that  a is  a 

cith  root  of  unity  in  GF (Pm).  Hence,  applying  the  CRT  to  the  finite  field  transform  pairs  as 
shown  in  Equations  (4.44)  and  (4.45),  the  relationship  holds 

Ml*  » Ax,  ■ ■ ■ , ALk ) 

d- 1 

= X ( ai/.’  ain>-  ■ ■’  am)  (« 1.0:2; . . .,(*0'* 

n=Q 

d- 1 d- 1 d-\ 

= ( 2 ainaf , X alnaf , . . . , £ aua?  )• 

n=  0 n=0  n=0 


(7.3) 


- 175  - 


Therefore,  the  technique  which  exhibits  the  highest  degree  of  parallelism  improves  the  dy- 
namic range  problem  effectively  and  the  regular  architecture  of  each  channel  is  readily 
suited  for  the  VLSI  implementation. 


APPENDIX  A 

ALGEBRAIC  FOUNDATIONS:  DEFINITIONS  AND  THEOREMS 
A.  1 GROUPS 

DLL  A group  is  a set  G together  with  a binary  operation  * on  G,  that  assigns  to  every  two 
elements  g,  h e G an  element  g * hoi  G,  such  that  the  following  axioms  hold: 

a.  The  associative  law 

b.  Existence  of  an  identity  element 

c.  Existence  of  the  inverse 

If  the  group  also  satisfies  g*  h = h*  g,  then  the  group  is  called  abelian  (or  commu- 
tative ). 

D1.2.  A multiplicative  group  G is  said  to  be  cyclic  if  there  is  an  element  a e G such  that  for 
any  be  G there  is  some  integer  i with  b=al . Such  an  element  a is  called  a generator  of  the 
cyclic  group,  and  we  write  G = <a  >. 

D1.3.  For  arbitrary  integer  a,  b and  a positive  integer  n , we  say  that  a is  congruent  to  b modu- 
lo m,  and  write  a = b mod  n,  if  the  difference  a-b  is  a multiple  of  n,  that  is,  if  a = b + kn  for 
some  integer  k.  Consider  the  equivalence  classes  into  which  the  relation  of  the  congruence 
modulo  n partitions  the  set  Z.  These  will  be  the  sets 

0 + nZ  ={...,  -2 n,  -n,  0,  n , 2 n,  . . . } 

1 + nZ  ={...,  1-2 n,  1 -n,  1,  1+n,  1+2 n, . . . } 

(rt-1)  + nZ  = { . . .,  -n- 1,  -1,  n- 1,  2n-l,  3/z-l, . . . } 


- 176- 


- 177  - 


D1.4.  The  group  formed  by  the  set  { nZ , 1 + nZ, . . .,  fo-l)  nZ  } ot  equivalence  elates 
modulo  n with  the  operation  ( a + nZ)  + (b  + nZ)  = (a  + b + nZ)\s  called  the  group  ot 
integers  modulo  n , and  denoted  by  Z„. 

D1.5,  A group  G is  called  finite  if  it  contains  finitely  many  elements.  The  number  of  ele- 
ments in  a finite  group  is  called  its  order  and  denoted  IGI. 

D1.6,  A mapping/:  G -*  H of  the  group  G into  H is  called  a homomorphism  of  G into  H if/ 
preserves  the  operation  of  G.  That  is,  if  * and  • are  the  operations  of  G and  H respectively,  then 
for  all  a,  be  G we  have  f(  a*  b)=f(a)  ■ f(b).  In  addition,  if/ is  onto  H,  then/ is  called  an 
homomorphism  onto.  If /is  a one-to-one  homomorphism  of  G onto  H , then/is  called  an  iia- 
morphism  and  we  say  that  G and  H are  isomorphic.  An  isomorphism  of  G onto  G is  called  an 
automorphism. 

D1.7.  The  kernel  of  the  homomorphism/ : G ->  H of  the  group  G into  the  group  H is  the  set 
k£r/=  { a e G :f(a)  = e'  ) where  e'  is  the  identity  element  in  H. 

D1.8.  A subgroup  H of  a group  G is  called  a normal  subgroup  if  gH  = Hg  for  all  g e G. 
D1.9.  The  left  cosets  of  G modulo  H are  denoted  by  aH  = { ah  : h e H } , where  a is  a fixed 
element  of  G. 

D1.10,  For  a normal  subgroup  H of  G,  the  group  formed  by  the  left  cosets  of  G modulo  H is 
called  the  quotient  group  or  factor  group  of  G modulo  H and  denoted  by  GIH. 

A.2  RINGS  AND  FIELDS 

D2. 1.  A ring  ( /?,+,•)  is  a set  /?,  together  with  two  binary  operations,  denoted  by  + and  • , 
such  that 

a.  (/?,  + ) is  an  abelian  group 

b.  The  multiplication  ‘ • ’ is  associative 

c.  The  distributive  laws  hold,  that  is  for  all  a,  b,  c e R 


- 178  - 


a ■ ( b+c  ) = a ■ b + a • c, 

( b+c  ) ■ a = b ■ a + c • a. 

D2.2.  A ring  is  called  a ring  with  identity  if  the  ring  has  a multiplicative  identity.  If  the  multi- 
plication is  commutative  then  the  ring  is  called  a commutative  ring.  A ring  is  called  an  inte- 
gral domain  if  it  is  a commutative  ring  with  identity  e * 0 in  which  ab  = 0 implies  a = 0 or  b = 
0. 

D2.3.  A subset  I of  a ring  R is  called  an  ideal  provided  / is  a subring  of  R for  all  a e / and  re  R 
we  have  are  I and  ra  e /. 

D2.4.  Let  R be  a commutative  ring.  An  ideal  / of  R is  said  to  be  principal  if  there  is  an  a e R 

such  that  / = (a),  where  (a)  = ( ra  : r e R }.  In  this  case,  / is  also  called  the  principal  ideal 

generated  by  a. 

D2.5.  The  ring  of  residue  classes  of  the  ring  R modulo  the  ideal  / under  the  operations 
(a  + /)  + (b  + /)  = (a  + b)  + /, 

(a+/)-(b+/)=(ab)+/ 

is  called  the  residue  class  ring  of  R modulo  / and  is  denoted  by  R/I. 

D2.6.  For  a prime  p,  let  Fp  be  the  set  { 0, 1,..  .,p-l  } of  integers  and  let  <p  :Z/(p)-+Fp  be  the 
mapping  defined  by  (f>  ( a + nZ ) = a for  a = 0,  1 , . . p-\ . Then  Fp , endowed  with  the  field 
structure  induced  by  <p  , is  a finite  field  called  the  Galois  field  of  order  p. 

D2.7.  For  a finite  field  Fp  we  denote  by  Fp  the  multiplicative  group  of  nonzero  elements  of 
Fp. 

D2,8.  A generator  of  the  cyclic  group  Fp  is  called  a primitive  element  of  Fp. 

D2.9.  A commutative  ring  R with  identity  is  called  a field  if  and  only  if  R\{  0 } is  a group  under 
multiplication,  where  R\(0}  = ( a : a e R,  a * 0 ). 

D2,  IQ,  If/?  is  an  arbitrary  ring  and  there  exists  a positive  integer  n such  that  nr  = 0 for  every  r 
e R,  then  the  least  such  positive  integer  n is  called  the  characteristic  of  R.  If  no  such  integer  n 
exists,  R is  said  to  have  characteristic  0. 


- 179  - 


D2. 1 1.  The  ring  formed  by  the  polynomials  over  R is  called  the  polynomial  ring  over  R 
and  denoted  by  /?[x]. 

D2.12.  Let 

n 

fix)  = X Ci,d 

i=0 

be  a polynomial  over R which  is  not  the  zero  polynomial,  so  that  we  can  suppose  <*n *■  0.  Then 
c in  is  called  the  leading  coefficient  of  fix)  and  do  the  constant  term,  which  n is  called  the  de- 
gree of  fix),  in  symbol  n = deg  {fix))  = degif).  If  the  leading  coefficient  of fix)  is  1,  then  fix)  is 
called  a monic  polynomial. 

D2.13.  A polynomial/?  e F[x]  is  said  to  be  irreducible  over  F if  p has  positive  degree  and/?  = 
be  with  b,  c e F[x]  implies  that  either  b or  c is  a constant  polynomial. 

D2.19.  An  element  b e F is  called  a root  or  zero  of  the  polynomial  /e  F[x]  if  fib)  = 0. 
D2.15.  A field  containing  no  proper  subfield  is  called  a prime  field. 

D2.16.  Let  A"  be  a subfield  of  F and  M any  subset  of  F.  Then  the  field  Ki\f)  is  defined  as  the 
intersection  of  all  subfields  of  F containing  both  K and  M and  is  called  the  extension  field  of  K 

obtained  by  adjoining  the  elements  in  M.  For  finite  Xt  = ( 9\,. . .,9n)  we  write  K(M)  = Kid  \ , . . 

9n).  If  M consists  of  a single  element  6 e F,  then  KiO  ) is  said  to  be  a single  extension  of  K 
and  0 is  called  a defining  element  of  Kid  ) over  K. 

D2. 17.  Let  A'  be  a subfield  of  F and  6 e F.  If  9 satisfies  a nontrivial  polynomial  equation 

over  K,  that  is,  if  UrBn  + . . . + <i\9  + a o = 0 w ith  4 e K not  all  being  0,  then  9 is  said  to  be 

algebraic  over  K.  An  extension  Ki9  ) of  K is  called  algebraic  over  K if  every  element  of  K(  9 ) 
is  algebraic  over  K. 

D2.18.  If  9 € F is  algebraic  over  K.  then  the  uniquely  determined  monic  polynomial  g € 
AT[.xr]  generating  the  ideal  / = { /e  AT[.r]  : }\9  ) = 0 } is  called  the  minimal  polynomial  (or 
defining  polynomial ) of  9 over  K.  By  the  degree  of  9 over  K we  mean  the  degree  of  g. 


- 180  - 


D2. 19.  If  AT  ( 0 ) is  finite  dimensional,  then  K{9  ) is  called  a finite  extension  of  K.  The  Jimon- 
sion  of  the  vector  space  K(0  ) is  called  the  degree  of  K(6  ) over  K.  in  symbol  [K\0  ) K | 
D2.2Q.  Let#  e AT-r]  be  of  positive  degree  and  fan  extension  field  of  K.  Then/ is  said  to  split 
in  F if/can  be  written  as  a product  of  linear  factor  in  F[.r]  - that  is,  if  there  exist  dements  cl  ;, 
«2, . . . Fsuch  that/Cx)  = a(x~a\)(x  -ai)  . . . (.x  -a*),  where  a is  the  leading  coefficient  of 
/.  The  field  F is  a splitting  field  of/over  K if/ splits  in  F . 

D2.21.  Let  Fp - be  an  extension  of  Fp  and  let  a e Fp> . Then  the  elements  a ,ap  ,ap' 

ap' ' are  called  the  conjugates  of  a with  respect  to  Fp. 

D2.22.  Let  m be  a positive  integer.  The  splitting  field  of  xm  - 1 over  a field  K is  called  the  /mh 
cyclotomic  field  over  K and  denoted  by  K(m).  The  roots  of  .if”  - 1 in  l6m)  are  called  the  mth 
roots  of  unity  over  K and  the  set  of  all  these  roots  is  denoted  by  i/m). 

D2.23.  Let  K be  a field  of  characteristic  p,m  a positive  integer  not  divisible  by  p,  and  a a 
primitive  mth  root  of  unity  over  K.  Then  the  polynomial 

m 

Qm(X ) = £(X  - (Xs) 

s=  1 

where  gcd(s,  m)  = 1,  is  called  the  mth  cyclotomic  polynomial  over  K. 

D2.24.  (Roots  of  unity  and  cyclotomic  polynomials)  According  to  D2.22,  a special  case  is 
obtained  if  K is  the  field  of  rational  numbers.  Then  lCm)  is  a subfield  of  the  field  of  complex 
numbers.  For  our  purposes,  the  most  important  case  is  that  of  a finite  field  K.  The  polynomial 

Qm(x)  is  clearly  independent  of  the  choice  of  a . The  degree  of  Qm(x)  is  <p  (m)  and  its  coeffi- 
cients obviously  belong  to  the  mth  cyclotomic  field  over  K. 

A. 3 THEOREMS 

TH3.1.  If  F is  a finite  field  with  q elements  and  A"  is  a subfield  of  F,  then  the  polynomial  xP  - 
x in  AXt]  as 


- 181  - 


-x  = X C*  “ a) 

j6F 

and  F is  a splitting  field  of  -r  over  /C. 

TH3.2.  ( Existence  and  uniqueness  of  finite  fields)  For  every  prime  p and  every  positive  inte- 
ger n there  exists  a finite  field  with  pn  elements.  Any  finite  field  with  q = pnelements  is  iso- 
morphic to  the  splitting  field  of  x?  -x  over  Fp.  We  should  denote  this  field  by  Fp  orGFi  p ). 
TH3.3.  If/is  an  irreducible  polynomial  in  Fp[x]  of  degreen,  then /has  a root  a in  /^.Fur- 
thermore, all  the  roots  of/are  simple  and  are  given  by  the  n distinct  elements  a ,ap  ,api 

ap"1  of  Fp*. 


APPENDIX  B 

CIRCUIT  DESIGNS:  HP-DCS  AND  MAGIC  CAD  TOOLS 

The  documentation  reported  in  this  Appendix  include  detailed  design  schematic.  Post- 
script printout  of  the  Magic  layouts,  and  HP-D  VI  timing  diagrams  of  the  third-order  polyno- 
mial RNS  processor.  At  the  last  part  of  this  Appendix,  examples  of  the  Caltech  Intermediate 
Form  (CEF)  file  and  the  Postscript  file  of  a two-input  NAND  gate  are  demonstrated. 

B.l  The  HP-DCS  Schematics  of  the  Nine-Bit  Third-Order  PRNS  Processor 

1)  mux9  - The  Nine-Bit  Multiplexer, 

2)  lat4b  - The  Four-Tuple  Latch; 

3)  mul8x8  - The  Eight-Bit  by  Eight-Bit  Multiplier; 

4)  scalerNRSQ  - The  (-r^-Scaler  Module;  and 

5)  scalerNR-The  (-r)-Scaler  Module. 


- 182- 


-183- 


GROUND 


Figure  B.1.1  mux9  - The  Nine-Bit  Multiplexer 


184  — 


Figure  B.1.2  lat4b  - The  Four-Tuple  Latch 


-185- 


Figure  B.1.3  mul8x8  - The  Eight-Bit  by  Eight-Bit  Multiplier 


-186- 


GROUND 


x i (8:0)  >■ 


.r 


x i (4  ) 
x i (5  ) 
x i ( 6 ) 
x i (7  ) 
x i (8  ) 


XO  ( 0 ) 
xo(  I ) 
xo  ( 2 ) 
xo  ( 3 ) 
xo  ( 4 ) 
xo  ( 5 ) 
xo  ( 6 ) 
xo  ( 7 ) 
xo  ( 8 ) 


V 

V 

V 

V 


V 

V 

V 


V 

V 


V 


X 1 (0) 

CMP  12 
neqmodp 
x i 0 x o 8 
x i 1 xo7 
x i 2 x o G 
x i 3 xo5 
x i 4 x o 4 
x i 5 x o 3 
x i 6 xo2 
x i 7 x o 1 
x i 8 xo0 

xo  ( 8 ) 

xo  ( 7 ) 

xo  ( S ) 

xo  ( 5 ) 

xo  ( 4 ) 

X i ( 1 ) 

xo  ( 3 ) 

x l (2  ) 

xo  ( 2 ) 

x i (3  ) 

xo  ( 1 ) 

xo  ( 0 ) 

^i, 


xo ( 8 : 0) 


add9M0Dpp 
b (0) 

b ( 1 ) CMP16 
b(2) 
b (3  ) 
b ( 4 ) 
b ( 5 ) 
b (6  ) 
b ( 7 ) 

aC0)S°(8:0) 
a ( 1 ) 
a ( 2 ) 
a ( 3 ) 
a ( 4 ) 
a ( 5 ) 
a ( B ) 


scalerNRSQ 


-^■xnq  (8:0) 


GROUND 


Figure  B.1.4  scalerNRSQ  - The  (-r^-Scaler  Module 


-187- 


GROUND 


xi  (8:0)  >- 


CMP  12 
neqmodp 
xi0  xo8 
x i l xo 7 
x i 2 xo6 
x i 3 xo5 
x i 4 xo4 
x i 5 xo 3 
x i 8 xo2 
x i 7 xo  1 
x i 8 xo0 

xo  C8) 

xo (7  ) \ 

x i ( 0 ) 

xo  C 6 ) ^ 

/ X 1 ( 1 ) 

xo  C 5 ) ^ 

f XI  (2) 

xo  (4  ) > 

r'  xi  (3) 

xo(3)  > 

{ x i ( 4 ) 

xo  (2  ) > 

/ XI (5) 

xocn  > 

r~ 

XO ( 0 ) ^ 

L?  1 

1 74F244  I 

i * 

I 

I 

Izf*  Cffl4  J 

1 74F?44  r 

<m  n 

1 e a 

2 a o 

I 

f 

L 

xn  p (8 : 0 ) 


GROUND 


Figure  B.  1 .5  scalerNR  - The  (-r)-Scaler  Module 


- 188  - 


B.2  The  HP-DVI  Timing  Diagrams  of  the  Nine-Bit  Third-Order  PRN$  Processor 

1)  The  Timing  Diagram  of  negmodp  - The  Negator; 

2)  The  Timing  Diagram  of  summodp  - The  Sum  Modulo  Convener. 

3)  The  Timing  Diagram  of  sumovck  - The  Sum  Overflow  Checker, 

4)  The  Timing  Diagram  of  mux9  - The  Nine-bit  Multiplexer; 

5)  The  Timing  Diagram  of  add9modp  - The  Nine-bit  Modulo  Adder; 

6)  The  Timing  Diagram  of  mul8x8  - The  Eight-Bit  by  Eight-Bit  Multiplier; 

7)  The  Timing  Diagram  of  mul9modp  - The  Nine-Bit  Modulo  Multiplier; 

8)  The  Timing  Diagram  of  forwmap  - The  Isomorphic  Forward  Mapping  Module; 

9)  The  Timing  Diagram  of  backmap  - The  Isomorphic  Inverse  Mapping  Module;  and 

10)  The  Timing  Diagram  of  scaler.NRSQ  - The  (-r2)-Scaler  Module. 


-189- 


Figure  B.2.1  The  Timing  Diagram  of  negmodp  - The  Negator 


-190- 


Figure  B.2.2  The  Timing  Diagram  of  summodp  - The  Sum  Modulo  Converter 


-191- 


Figure  B.2.3  The  Timing  Diagram  of  sumovck  - The  Sum  Overflow  Checker 


192 


Figure  B.2.4  The  Timing  Diagram  of  mux9  - The  nine- bit  multiplexer 


-193- 


Figure  B.2.5  The  Timing  Diagram  of  add9modp  - The  Nine-bit  Modulo  Adder 


- 194- 


Figure  B.2.6  The  Timing  Diagram  of  mu!8x8  - The  Eight-Bit  by  Eight-Bit  Multiplier 


3WN  1WNDIS 


195 


Figure  B.2.7  The  Timing  Diagram  of  mu!9modp  - The  Nine-Bit  Modulo  Multiplier 


I729l2ps 


TIME  BASE  - Ins  MAX  SIM  TIME  - lus 
EVENT  GRID  - Ins  PULSE  WIDTH  - 10ns  Last  end  time  - 948200ps 


196 


Figure  B.2.8  The  Timing  Diagram  of  forwmap  - 
The  Isomorphic  Forward  Mapping  Module 


006  000  00<!  009  80S  00V  00L  000  001  0 sd^|  | B8 


-197- 


VI 

s 


sx 


Figure  B.2.9  The  Timing  Diagram  of  backmap  - 
The  Isomorphic  Inverse  Mapping  Module 


-198- 


Figure  B.2.10  The  Timing  Diagram  of  scalerNRSQ  - The  (-r2)- scaler  module 


-199- 


B.3  The  Magic  Design  of  the  Five-Bit  Third-Order  PRNS  Processor 

1)  The  Magic  Design  Circuit  diagram  I; 

2)  The  Magic  Design  Circuit  diagram  II; 

3)  The  Magic  Design  Circuit  diagram  III; 

4)  The  Magic  Design  Circuit  diagram  IV; 

5)  The  Magic  Design  Circuit  diagram  V; 

6)  The  Magic  Transistor  Layout  of  multip  - The  Three  by  Three  Multiplier; 

7)  The  Magic  Transistor  Layout  of  sumall  - The  Modular  Adder, 

8)  The  Magic  Transistor  Layout  of  smod  — The  Modular  Adder  for  the  isomorphic  map- 
pings; 

9)  The  Magic  Transistor  Layout  of  scln  - The  FFT-Type  module  for  the  isomorphic  map- 
pings; 

10)  The  Magic  Transistor  Layout  of  fmap  - The  Isomorphic  forward  mappings; 

11)  The  Magic  Transistor  Layout  of  padin  - The  Input  Pad;  and 

12)  The  Magic  Transistor  Layout  of  padout  - The  Output  Pad. 


-200- 


FULL  ADDER  FOR  ADDMODP  (FILE  NAME:  zfad.mag) 


MODULO  P ADDER  (FILE  NAME:  sumall.mag) 


Figure  B.3. 1 The  Magic  Design  Circuit  diagram  I 


— 201  - 


FULL  ADDER  FOR  MULTIPLIER  (FILE  NAME:  mf adder. mag) 


on 


3 BY  3 MULTIPLIER  (FILE  NAME:  multip.mag) 


§ 3 

LI 


nj  — s 

o.  q.  a. 


Figure  B.3.2  The  Magic  Design  Circuit  diagram  II 


-202- 


MODULES  OF  THE  5-BIT  POLYNOMIAL  RNS  PROCESSOR 
Note:  (t  of  transistors,  (cell  length  x,  cell  heigh  yhlanbda) 

irulnodp  ( 5-bit  modulo  P multiplier)  ( 1138,  ( 1010, 1400) ) 
scalerX  ( Scaler  nodule  X ) (3)2,(835,750)) 

palneg 
sumal I 


fmap  ( Isomorphic  forward  napping)  (4624, (5650, H50) ) 


scaler 

NRSQ 

scaler 

RSO 

sunal 1 

sunal 1 

sunal 1 

sunal 1 

scaler 

RSQ 

scaler 

NRSQ 

scaler 

scalar 

scaler 

scaler 

sunal  1 

sumal 1 

R 

NR 

RCUB 

MRCUB 

sunal 1 

sunal 1 

bmap  ( Isomorphic  backward  napping)  (3828,(5650,1X50)) 


scaler 

scaler 

sunal 1 

scaler 

R 

sunal 1 

sunal 1 

scaler 

scaler | 

NEC 

W7CUB 

NEC 

NEC 

scaler 

scaler 

scaler 

sunal 1 

sunal 1 

sunal 1 

NR 

RSQ 

MRSQ 

sumI  1 

sunal 1 | 

sumal 1 (modulo  p adder  ) 


(266,(620,484)) 


tritr3  (3-bit  tri-state  trans/receiver) 


t r 1 1 r 16 

tr itr  16 

tr i tr  16 

(48,(105,230)) 


Figure  B.3.3  The  Magic  Design  Circuit  diagram  III 


203 


Figure  B.3.4  The  Magic  Design  Circuit  diagram  IV 


SYSTEM  BLOCK  OF  THE  5-BIT  MODULO  P MULTIPLIER 
FILE  NAME:  mulmodp.mag 


204 


Figure  B.3.5  The  Magic  Design  Circuit  diagram  V 


1 dlatch. sag/ input 


-205- 


Figure  B.3.6  The  Magic  Transistor  Layout  of  multip  - The  Three  by  Three  Multiplier 


-206- 


Figure  B.3.7  The  Magic  Transistor  Layout  of  sumall  - The  Modular  Adder 


-207- 


Figure  B.3.8  The  Magic  Transistor  Layout  of  smod  - 
The  Modular  Adder  for  the  isomorphic  mappings 


Figure  B.3.9  The  Magic  Transistor  Layout  of  scln  - 

The  FFT-Type  module  for  the  isomorphic  mappings 


-209 


Figure  B.3.10  The  Magic  Transistor  Layout  of  fmap  - 
The  Isomorphic  forward  mappings 


-210- 


Figure  B.3. 1 1 The  Magic  Transistor  Layout  of  padin  - The  Input  Pad 


- 211  - 


Figure  B.3. 12  The  Magic  Transistor  Layout  of  padout  - The  Output  Pad 


-212- 


B.4  riF  and  Postscript  Files  of  a Two-Input  NAND  Gate 
B.4.1  GIF:  A Geometry  Language 

CIF.  the  Caltech  Intermediate  Form  is  a language  whose  primitives  are  colored  shapes. 
It  is  called  “intermediate”  because  it  is  both  a language  in  which  designs  of  circuits  can  be 
couched  and  a language  from  which  the  masks  (like  negatives)  from  which  a chip  is  fabri- 
cated can  be  made  automatically  [MeaSOa].  A brief  discussion  of  the  language  and  an  exam- 
ple of  a two-input  NAND  gate  are  presented. 

( 1 ) The  Box  Statement 

The  most  basic  statement  is  one  defining  a rectangle,  called  a box.  We  place  a box  at  a 
particular  place  on  a grid  by  a statement  of  the  form 
B <xdist>  <ydist>  <xcent>  <ycem> 

where  <xdist>  and  <ydist>  are  the  lengths  of  the  sides  along  the  x and  y axes:  and  <xcent> 
<ycent>  are  the  x and  y coordinates  of  the  center  of  the  rectangle.  We  can  also  specily  a rota- 
tion of  the  rectangle  with  two  additional  components.  The  clause 
Dab 

calls  for  the  x axis  to  be  rotated  until  it  has  slope  alb.  Thus,  a box  of  length  6,  width  2.  center  at 
(1,3)  and  being  rotated  forty-five  degrees  clockwise  is  specified  as 
B 6 2 1 3 D -1  1. 

(2)  The  Layer  Statement 

A box  must  be  assigned  a “color”  or  “layer”  in  addition  to  the  box  statement.  The  state- 
ment 

L <layer  designation> 

specifies  a layer.  The  <layer  designation>  is  a code  for  the  name  of  one  of  the  layers  used  in 
the  design.  The  most  commonly  used  ones  in  the  CMOS  technology  are  CWN  (nwell),  CWP 
(pwell),  CMS  (allMetal2),  CAA  (allDiff),  CCA  (ndc.pdc)  and  CCP  (pc). 

(3)  Defined  Cells 


-213- 


CIF  has  a mechanism  much  like  a procedure  call,  that  enables  us  to  detine  a cell,  or  sym- 
bol, to  be  some  collection  of  the  shapes,  and  then  to  call  that  cell  as  many  times  as  we  v,  ish. 
The  definition  of  a cell  is  introduced  by  a statement  of  the  form 
DS  <symbol  number>  <scale> 

where  the  <symbol  niunber>  is  an  integer  that  serves  as  the  name  ot  the  cell,  and  <scu/tf>  is 
a pair  of  integers  a and  b , such  that  all  dimensions  and  coordinates  of  boxes  are  multiplied  b\ 
alb.  The  DS  stands  for  “Definition  Stan”  whereas  the  end  of  a definition  is  marked  by  the 
statement  DF,  or  “definition  Finish.”  The  fundamental  unit  of  dimensions  is  not  A but  0.01 
micron.  Thus,  if  A =1.5  microns  then  <scale>  = 150  1. 

The  call  of  a symbol  is  effected  by  the  statement 

C <symbol  number>  <list  of  transformations> 
where  <list  of  transformaiions>  is  a list  of  elements,  each  with  one  of  the  forms  (a)  T <xori- 
gin>  <yorigin>.  (b)  Rub.  (c)  \1X  and/or  MY.  These  forms  stand  for  Translate,  Rotate,  and 
Mirror  respectively.  An  CIF  example  of  a two-input  NAND  gate  is  demonstrated  in  Fig- 


ureB.4.1. 


-214- 


DS  1 50  2; 

9 nand2; 

LCWN; 

B 104  8 92  40; 

B 120  68  92  2; 

L CMF; 

B 80  16  92  24; 

B 16  28  60  2; 

B 16  16  92-4; 

B 16  28  124  2; 

B 12  12  94-18; 
B44  12  110  -30; 
B 12  16  126-44; 
B 32  28  68  -66; 

B 16  16  124  -60; 
B 80  16  92  -88; 
LCPG; 

B 8 72  76  -4; 

B 24  8 84  -44; 

B 8 48  92  -72; 
B8  128  108-32; 
LCAA; 

B 16  28  60  18; 

B 16  28  124  18; 
B 80  16  92-4; 

B 80  16  92  -60; 


L CCA; 

B 8 860-4; 

B 8 8 92  -4; 

B 88  124-4; 

B 8 8 76  -60; 

B 8 8 124  -60; 

L CCA; 

B 8 8 60  24; 

B 8 8 124  24; 

B 8 860-60; 

LCSN; 

B 32  24  60  28; 

B 32  24  124  28; 

B 72  32  104  -60; 

L CSP; 

B 32  4 60  14; 

B 32  4 124  14; 

B 96  32  92  -4; 

B 24  32  56  -60; 

94  O 124 -32  CMF; 

94  A 76  -44  CPG; 

94  B 108  -44  CPG; 
94GND!  76  -76  CMF; 
94  Vdd!  92  24  CMF; 
DF; 

C 1; 

End 


Figure  B.4.1  The  CIF  Description  of  A Two-Input  NAND  Gate 


- 215  - 


R42  PostScript:  A Geometry  Language 


%!  PS- Adobe- 1.0 
%%Creator  cif2ps 
acccDocumentFonis:  Helvetica 
%%Pages:  (atend) 

%%EndComments 

% PostScript  from  "cif2ps,”  a CIF  to  PostScript  translator  by: 

% Arthur  Simoneau,  The  Aerospace  Corporation,  El  Segundo,  Calif 

% with  additions  by: 

% Marc  Lesure,  Arizona  State  University,  Tempe,  A Z (May  1988) 

% as  well  as: 

% Gordon  W.  Ross,  The  MITRE  Corporation,  Bedford,  MA  (June  1989) 

% (proud  author  of  this  header  code  :-) 

% Header  code  follows: 


IB 


% dx  dy  xl  yl  ==>  — 

% Build  a box:  size=(dx,dy);  lower_lcft=(xl.yl) 
newpath  moveto  dup  0 exch  %dxdy0dy 


rlineto  exch  0 
rlineto  0 exch  neg 
rlineto  closepath 
bind  def  % B 


% dy  dx  0 


% 0 -dy 


/L  ( % step  angle  ==>  — 

% Fill  the  current  path  with  lines  spaced  by  STEP  and 
% rotated  from  the  X-axis  by  ANGLE  degrees  ccw. 
gsave  clip  0 setgray  0 setlinewidth 
matrix  setmatnx 


rotate  dup  scale 
pathbbox  newpath 
1 add  cvi  4 1 roll 
1 add  cvi  4 1 roll 

1 sub  cvi  exch  1 sub  cvi  % x2’  y2 

2 copy  exch  translate 

3 1 roll  sub 

3 1 roll  sub  exch 
{ % repeat  (dy)  times 

0 0 moveto  dup  0 
rlineto  stroke 
0 1 translate 

) repeat  pop 
grestore 
bind  def  % L 


% xl  yl  x2  y2 
% y2’  xl  yl  x2 
% x2’  y2’  xl  y 1 
yl’  xl’ 


% x2’  xl’  dy 
% dx  dy 
% dx 

% dx  dx  0 
% dx 
% dx 


Figure  B.4.2  The  PostScript  Description  of  a Two-Input  NAND  Gate 


-216- 


% x2  y2  a xl  yl  a 
% x2  y2  a 
% x2  y2  x2  y2  a 
% x2  y2 

% x2  yl  (leave  for  next  Wto) 


/WW  1 def  % default  Wire  Width  (space  from  line) 

/Wto  { %xl  yl  x2y2=>x2y2 

% Draws  a path  spaced  WW  from  the  line  (,xl,ylHx2.y2) 

newpath 

% wire  angle  a=atan(dy/dx) 

4 2 roll  4 copy  exch  % x2  y2  xl  yl  x2  y2  yl 

xl 

3 1 roll  sub  3 1 roll  sub  % x2  y2  xl  y 1 dy  dx 
atan  dup  4 1 roll 

WW  exch  90  add  dup  180  add  arc 
3 copy  pop  3 2 roll 
WW  exch  90  sub  dup  180  add  arc 
closepath 
} bind  def  % Wto 

/X  ( % Draw  an  X on  the  current  figure 

gsave  clip 

palhbbox  newpath  % xl  yl  x2  y2 

4 copy  moveto  lineto  % x 1 y 1 x2  y2 

3 1 roll  exch  % x 1 y2  x2  y 1 

moveto  lineto  stroke  % 

grestore 

} bind  def  % X%  End  of  header  code 
%%EndProlog%%Page:  0:0 
36  dup  translate  % margins 
/Helvetica  findfont  6 0. 18  div  scalefont  setfont 
0.18  dup  scale  % points/centi-micron 
-800  2400  translate  % ccll_ongin 
.5  setgray 
% CAA 

400  700  1300  100  B fill 
400  700  2900  100  B fill 
2000  400  1300  -300  B fill 
2000  400  1300-1700  B fill 
.3  setgray 
9c  CPG 

200  1800  1800 -1000  B fill 
600  200  1800 -1200  B fill 
200  1200  2200  -2400  B fill 
200  3200  2600  -2400  B fill 
0 setgray 


Figure  B.4.2 — continued 


-217  - 


% CCA 

200  200  1400  -200  B fill 
200  200  2200  -200  B fill 
200  200  3000  -200  B fill 
200  200  1800 -1600  B fill 
200  200  3000 -1600  B fill 
200  200  1400  500  B fill 
200  200  3000  500  B fill 
200  200  1400  -1600  B fill 
0 setgray 
% CWN 

2600  200  1000  900  B 32  135  L 
3000  1700  800 -800  B 32  135  L 
% CMF 

2000  400  1300  400  B 8 45  L 
400  700  1300  -300  B 8 45  L 
400  400  2100 -300  B 8 45  L 
400  700  2900  -300  B 8 45  L 
300  300  2200  -600  B 8 45  L 
1100  300  2200  -900  B 8 45  L 
300  400  3000 -1300  B 8 45  L 
800  700  1300  -2000  B 8 45  L 
400  400  2900  -1700  B 8 45  L 
2000400  1 300 -2400  B 8 45  L 
% CSN 

800  600  1100  400  B 16  135  L 
800  600  2700  400  B 16  135  L 
1800  800  1700  -1900  B 16  135  L 
%CSP 

800  100  1100  300  B 1645  L 
800  100  2700  300  B 16  45  L 
2400  800  1100  -500  B 16  45  L 
600  800  1100-1900  B 1645  L 
0 setgray 

3 100  -800  moveto  (0)  show 
1900  -1100  moveto  (A)  show 
2700  -1100  moveto  (B)  show 
1900-1900  moveto  (GND!)  show 
2300  600  moveto  (Vdd!)  show 
showpage 
%%Trailer 
%%Pages:  1 


Figure  B.4.2 — continued 


BIBLIOGRAPHY 


[Adl79]  Adleman,  L.,  “A  Subexponential  Algorithm  for  the  Discrete  Logarithm  Problem 
with  Applications  to  Cryptography,”  in  Proceeding  IEEE  20th  Annual  Sympo- 
sium on  Foundations  of  Computer  Science , pp.  55-60,  1979. 

[Aga74a]  Agarwal,  R.  C.  and  Burrus,  C.  S.,  “Fast  One-Dimensional  Digital  Convolution 
by  Multidimensional  Techniques,”  IEEE  Transactions  on  Acoustics,  Speech, 
and  Signal  Processing,  Vol.  ASSP-22,  No.  1,  pp.  1-10,  February,  1974. 

[Aga74b]  Agarwal,  R.  C.  and  Burrus,  C.  S.,  “Fast  Convolution  Using  Fermat  Number 
Transform  With  Applications  to  Digital  Filtering,”  IEEE  Transactions  on 
Acoustics,  Speech,  and  Signal  Processing,  Vol.  ASSP-22,  No.  4,  pp.  87-97, 
April,  1974. 

[Aga75]  Agarwal,  R.  C.  and  Burrus,  C.  S.,  “Number  Theoretic  Transforms  to  Implement 
Fast  Digital  Convolution,”  Proceedings  of  the  IEEE , Vol.  63,  No.  4,  pp.  550-560, 
April,  1975. 

[Aho74]  Aho,  A.  V.,  Hopcroft,  J.  E„  Ullman,  J.  D.,The  Design  and  Analysis  of  Algorithms, 
Reading,  MA:  Addison- Wesley  Publishing  Company,  1974. 

[Ber68]  Berlekamp,  E.  R.,  Algebraic  Coding  Theory,  New  York,  NY:  McGraw-Hill, 
1968. 

[Bla83]  Blahut,  R.  E.,  Theory'  and  Practice  of  Error  Control  Codes,  Reading,  MA:  Addi- 
son-Wesley  Publishing  Company,  1983. 

[Bla85]  Blahut,  R.  E.,  Fast  Algorithms  for  Digital  Signal  Processing,  Reading,  MA:  Ad- 
dison-Wesley  Publishing  Company,  1985. 

[Blak84]  Blake,  I.  F.,  Fuji-Hara,  R.,  Mullin,  R.  C.,  and  Vanstone.  S.  A.,  “Computing  Loga- 
rithms in  Finite  Fields  of  Characteristic  two,”  SIAM  Journal  of  A lgebraic  and 
Discrete  Methods,  Vol.  5,  No.  2,  pp.  276-285,  June  1984. 

[Cmo87]  CMOS  Cell  Library  Development  Project,  Tampa,  FL:  University  of  South  Flo- 
rida, May,  1987. 


-218- 


-219- 


I Con  68] 
[Coz85] 

[DEC90] 

[E1182] 

[Gam85] 

[G006O] 

[Har65] 

[Hpd88] 

[Hub901 

[Ima80] 

[Jen80] 

[Jen87] 

[Kao85] 

[Kao87] 


Conway,  J.  H.,  “A  Tabulation  of  Some  Information  Concerning  Finite  Fields,  in 
Computers  in  Mathematical  Research , R.  F.  Churchhouse  and  J.  C.  Herz.  Eds. 
Amsterdam:  North-Holland,  pp.  37-50,  1968. 

Cozzens,  J.  H.  and  Finkelstein,  L.  A.,  “Computing  the  Discrete  Fourier  Trans- 
form Using  Residue  Number  Systems  in  a Ring  of  Algebraic  Integers,  IEEE 
Transactions  on  Information  Theory,  Vol.  IT-3 1 , No.  5.  pp.  580-088,  September 
1985. 

1990  DECWRLI  Livermore  Magic  Release,  Palo  Alto,  CA:  Western  Laboratory. 
September,  1990. 

Elliott,  D.  F.,  Rao,  K.  R.,  Fast  Transforms  Algorithms,  Analyses.  Applications, 
New  York,  NY:  Academic  Press,  Inc.,  1982. 

Games,  R.  A.,  “Complex  Approximations  Using  Algebraic  Integers,"  IEEE 
Transactions  on  Information  Theory,  Vol.  IT-31,  No.  5,  pp.  565—579,  September 
1985. 

Good,  I.  J.,“The  Iteration  Algorithm  and  Practical  Fourier  Series."/.  Royal  Sta- 
rts. Soci.,  Ser.  B20,  pp.  361-372,  1958,  Addendum  22,  pp.  372-375,  1960. 

Hardy,  G.  H.,  Wright,  E.  M.,  An  Introduction  to  the  Theory  of  Numbers,  Oxford, 
Claredon  Press,  1965.  (512  .81  H269i4) 

HP  DCS  Manuals,  Colorado  Springs,  CO:  Hewlett-Packard  Company,  1988. 

Huber,  K„  “Some  Comments  on  Zech’s  Logarithms,”  IEEE  Transactions  on  In- 
formation Theory,  Vol.  IT-36,  No.  4,  pp.  946-950,  July  1990. 

Imamura,  K.,  “A  Method  for  Computing  Addition  Tables  in  GF(  pn  ) ,”  IEEE 
Transactions  on  Information  Theory,  Vol.  IT-26,  No. 3,  pp.  367-369,  May  1980. 

Jenkins,  W.  K„  “Complex  Residue  Number  Arithmetic  for  High-Speed  Signal 
Processing,”  Electronics  Letters,  14th,  Vol.  16,  No.  17,  pp.  660-661,  August 
1980. 


Jenkins,  W.  K„  Krogmeier,  J.  V.,  “The  Design  of  Dual-Mode  Complex  Signal 
Processors  Based  on^Quadratic  Modular  Number  Codes,"  / EEE  Transactions  on 
Circuits  and  Systems.  Vol.  CAS-34,  pp.  354—364,  April  1987. 

Kao,  R.  S„  “A  Single  Modulus  Complex  ALU  for  Digital  Signal  Processing," 
Master’s  Thesis,  University  of  Florida,  Gainesville,  December  1985. 

Kao,  R.  S.,  and  Taylor,  F.  J.,  “Implementation  of  the  Single  Modulus  Complex 
ALU,”  Proceeding  of  the  Sth  Symposium  on  Computer  Arithmetic,  pp.  21-27, 
May  1987. 


-220- 


i Kao90]  Kao,  R.  S.,  and  Taylor.  F.  J.,  “A  Fast  Galois  Field  Transform  Algorithm  Using 
Normal  Bases,”  accepted  for  publication  in  IEEE  Proceeding  oj  the  24  th  Asilo- 
mar  Conference  on  Signals,  Systems  & Computers.  November  1990. 

IKao91a]  Kao,  R.  S.,  “A  Multiplier-Free  Fast  transform  with  Efficient  VLSI  Implementa- 
tion for  Polynomial  RNS  Processors,”  accepted  for  publication  in  / EEE  Proceed- 
ing of  ICASSP,  May  1991. 

[Kao91b]  Kao,  R.  S„  and  Taylor,  F.  J.,  “The  Basis-Change  Algorithm  for  Fast  Finite  Field 
Transforms,”  accepted  for  publication  in  IEEE  Southeastcon  91.  April,  1991. 

[Knu69]  Knuth,  D.  E„  The  Art  of  Computer  Programming.  Vol.  II:  Seminumerical  Algo- 
rithms. Reading,  MA:  Addison-Wesley  Publishing  Company,  1969. 

[Kol77]  Kolba,  D.  P,  and  Parks,  T.  W„  “A  Prime  Factor  FFT  Algorithm  Using  High- 
Speed  Convolution,”  IEEE  Transactions  on  Acoustics,  Speech,  and  Signal  Pro- 
cessing, Vol.  ASSP-25,  No.  4,  pp.  281-294,  August  1977. 

!Kro83]  Krogmeier,  J.  V..  Jenkins,  W.  K.,  “Error  Detection  and  Correction  in  Quadratic 
Residue  Number  Systems,”  Proceedings  of  26th  Midwest  Symposium  onCircuits 
and  Systems,  pp.  408—411,  August  1983. 

[Kron79]  Kronsjo,  L.  1.,  Algorithms:  Their  Complexity  and  Efficiency,  New  York:  John 
Wiley  & Sons,  1979. 

[Lid86]  Lidl,  R.  and  Niederreiter,  H.,  Introduction  to  Finite  Fields  and  Their  Applica- 
tions, Cambridge,  MA:  Cambridge  University  Press,  1986. 

[Lin83]  Lin,  S..  Costello,  D.  J.  Jr.,  Error  Control  Coding  Fundamentals  and  Applications, 

Englewood  Cliffs,  NJ:  Prentice- Hall,  1983. 

[Lip8 1 ] Lipson,  J.  D„  Elements  of  Algebra  and  Algebraic  Computing , Reading,  MA:  Ad- 
dison-Wesley Publishing  Company,  1981. 

[Leu8 1 ) Leung,  S.,  “Application  of  Residue  number  systems  to  Complex  digital  Filters," 
IEEE  Proceeding  of  the  15th  Asilomar  Conference  on  Signals,  Systems  & Com- 
puters, November  1981. 

[Mac77]  Mac  Williams,  F.  J.  and  Sloane,  N.  J.  A.,  The  Theory  of  Error-Correcting  Codes. 
Amsterdam:  North-Holland,  1977. 

[McC79]  McClellan,  J.  H.  and  Rader,  C.  M„  Number  Theory  in  Digital  Signal  Processing, 
Englewood  Cliffs,  NJ:  Prentice-Hall,  1979. 

[McE87]  McEliece,  R.  J.,  Finite  Fields  for  Computer  Scientists  and  Engineers,  Norwell, 
MA:  Kluwer  Academic  Publishers,  1987. 

[Mea80]  Mead,  C.  and  Conway,  L„  Introduction  to  VLSI  Systems,  Reading,  M A:  Addison- 
Wesley  Publishing  Company,  1980. 


-221  - 


[Mul89]  Muilin,  R.  C.,  Onyszchuk,  I.  M„  Vanstone,  S.  A.,  and  Wilson.  R.  \1„  "Optimal 
Normal  Basis  in  GF(  pn  ),”  Discrete  Applied  Mathematics,  Vol.  22.  pp.  1 49- 161. 
1989. 

[Nag64]  Nagell,  T„  Introduction  to  Number  Theory,  New  York:  Chelsea,  1964. 

[Niv80]  Niven,  1.,  Zuckerman,  H.  S.,  An  Introduction  to  the  Theory  of  Numbers,  New 
York:  John  Wiley  & Sons,  1980. 

[Nus82]  Nussbaumer,  H.  J.,  Fast  F ourier  Transform  and  Convolution  Algorithms , 2nd  ed. 
Berlin:  Springer- Verlag,  1982. 

[Opp75]  Oppenheim,  A.  V.,  and  Schafer,  R.  W.,  Digital  Signal  Processing,  Englewood 
Cliffs,  NJ:  Prentice-Hall,  1975. 

[Pet72]  Peterson,  W.  W.,  Weldon,  E.  J.  Jr.,  Error  Correcting  Codes,  Cambridge,  MA: 
MIT  Press,  1972. 

[Pol50]  Pollard,  H.,  Theory  of  Algebraic  Numbers,  Cams  Math.  Monographs,  No.  9, 
Math.  Ass.  Amer.,  1950. 

[Rad68]  Rader,  C.  M.,  “Discrete  Fourier  Transforms  When  the  Number  of  Data  Samples 
Is  Prime,”  Proceeding  IEEE,  Vol.  56,  pp.  1107-1108,  June  1968. 


[Rad72]  Rader,  C.  M.,  “Discrete  Convolution  via  Mersenne  Transforms,”  IEEE  Transac- 
tions on  Computers,  Vol.  C-21,  No.  12,  pp.  1269-1273,  December  1972. 

[Rao8 1]  Rao,  T.  R.  N.,  “Arithmetic  of  Finite  Fields,”  Proceeding  of  the  5 th  Symposium  on 
Computer  Arithmetic,  pp.  2-5,  May  1981. 

[Red86]  Redinbo,  G.  R.  and  Rao,  K.  K.,  “Expediting  Factor-Type  Fast  Finite  Field  Trans- 
form Algorithms,”  IEEE  Transactions  on  Information  Theory,  Vol.  IT-32,  No.  2, 
pp.  168-194,  March  1986. 

[Ree75a]  Reed,  I.  S.  and  Truong,  T.  K.,  “The  Use  of  Finite  Fields  to  Compute  Convolu- 
tions,” IEEE  Transactions  on  Information  Theorx,  Vol.  IT-21,  No.  2,  pp. 
208-213,  March  1975. 

[Ree75b]  Reed,  I.  S.  and  Truong.  T.  K.,  “Complex  Integer  Convolutions  Over  a Direct  Sum 
of  Galois  Fields,”  IEEE  Transactions  on  Information  Theory,  Vol.  IT-21,  No.  6, 
pp.  208-213,  November  1975. 

[Sch86]  Schroeder,  M.  R.,  Number  Theory  in  Science  and  Communication,  Heidelberg: 
Springer- Verlag,  1986. 

[Ska87]  Skavantzos,  A.,  “The  Polynomial  Residue  Number  System  and  Its  Applica- 
tions,” Ph.D.  dissertation.  University  of  Florida,  Gainesville,  1987. 


[Sod861 

[Tay81] 

[Tay82] 

[Tay83] 

[Tay84] 

[Tay85] 

[Van78] 

[Veg76] 

[Wan85] 

[Wes85] 


Soderstrand,  M.  A.,  Jenkins,  W.  K.,  Juilien,  G.  A.,  and  Ta\  lor,  F.  J.,  eds..  Residue 
Number  System  Arithmetic:  Modern  Applications  in  Digital  Signal  Processing. 
New  York,  NY:  IEEE  Press,  1986. 

Taylor,  F.  J.,  “Large  Moduli  Multipliers  for  Signal  Processing,"  IEEE  Transac- 
tions on  Circuits  and  Systems,  Vol.  CAS-28,  No.  7,  pp.  731-736,  July  1981. 

Taylor,  F.  J.,  “A  VLSI  Residue  Arithmetic  Multiplier,"  IEEE  Transactions  on 
Computers,  Vol.  C-31,  No.  6,  pp.  540-546,  June  1982. 

Taylor,  F.  J.,  “An  Overflow-Free  Residue  Multiplier,"  IEEE  Transactions  on 
Computers,  Vol.  C-32,  No.  5,  pp.  501-504,  May  1983. 


Taylor,  F.  J.,  “Residue  Arithmetic:  A Tutorial  with  Examples,”  IEEE  Journal 
Computers,  Vol.  17,  No.  5,  pp.  50-63,  May  1984. 


Taylor,  F.  J.,  Papadourakis,  G.,  Skavantzos,  A.,  and  Stouraitis,  A.,  “A  Radix-4 
FFT  Using  Complex  RNS  Arithmetic,”  IEEE  Transactions  on  Computers,  Vol. 
C-34,  No.  6,  pp.  573-576,  June  1985. 

Vanwormhoudt,  M.  C.,  “Structural  Properties  of  Complex  residue  Rings  applied 
to  Number  Theoretic  Fourier  Transforms,”  IEEE  Transactions  on  Acoustics, 
Speech,  and  Signal  Processing,  Vol.  ASSP-26,  No.  1,  pp.  99-104,  February 
1978. 

Vegh,  E.  and  Leibowitz  L.  M„  “Fast  Complex  Convolution  in  Finite  Rings," 
IEEE  Transactions  on  Acoustics,  Speech,  and  Signal  Processing,  Vol.  ASSP-, 
No.  4,  pp.  343-344,  August  1976. 

Wang,  C.  C.,  Truong,  T.  K„  Shao,  H.  M.,  Deutsch,  L.  J.,  Omura,  J.  K„  and  Reed,  I. 
S.,  “VLSI  Architectures  for  Computing  Multiplications  and  Inverses  in 
GF(  2 m),"  IEEE  Transactions  on  Computers , Vol.  34,  No.  8,  pp.  709-717,  Au- 
gust 1985. 

Weste,  N„  and  Eshraghian,  K„  Principles  of  CMOS  VLSI  Design  A System  Per- 
spective, Reading,  MA:  Addison-Wesley  Publishing  Company,  1985. 


BIOGRAPHICAL  SKETCH 


Rom-Shen  Kao  was  bom  in  Penhu,  Taiwan,  on  October  19, 1955.  He  received  a BSEE 
at  the  National  Chiao-Tung  University,  Hsin-Chu,  Taiwan,  in  1979.  From  October  1979  to 
August  1981,  he  served  in  the  Chinese  Air  Force  as  an  electronics  officer.  In  1981,  he  joined 
the  Electronics  Research  and  Service  Organization,  Hsin-Chu,  Taiwan,  as  a microcomputer 
hardware  design  engineer.  He  entered  the  graduate  school  at  the  University  of  Florida  in  Au- 
gust 1983  and  received  a MSEE  in  1985.  From  1 985  to  1987,  he  worked  as  a design  engineer 
at  the  Vital  Industrial,  Inc.,  Gainesville,  Florida,  where  he  was  involved  in  the  development 
of  video  post  production  equipments.  In  1987,  he  joined  the  Digital  Services  Corporation, 
Gainesville,  Florida,  as  a video  signal  processing  engineer.  He  is  currently  completing  his 
Ph.D  in  electrical  engineering  at  the  University  of  Florida.  He  is  scheduled  to  receive  his 
Doctor  of  Philosophy  degree  in  May  of  1991. 


-223- 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable  stan- 
dards of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a dissertation 
for  the  degree  of  Doctor  of  Philosophy. 


"red/  faylor,  Chairman 
Professor  of  Electrical  Engineering 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable  stan- 
dards of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a dissertation 
for  the  degree  of  Doctor  of  Philosophy. 


\( 

4 f l 

Y.  Cl 

Chow 

Professor  of  Computer  and  Information 
Sciences 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable  stan- 
dards of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a dissertation 
for  the  degree  of  Doctor  of  Philosophy. 


Associate  Professor 
of  Electrical  Engineering 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable  stan- 
dards of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a dissertation 
for  the  degree  of  Doctor  of  Philosophy. 


Mark  Law 
Assistant  Professor 
of  Electrical  Engineering 


I certify  that  I have  read  this  study  and  that  in  my  opinion  it  conforms  to  acceptable  stan- 
dards of  scholarly  presentation  and  is  fully  adequate,  in  scope  and  quality,  as  a dissertation 
for  the  degree  of  Doctor  of  Philosophy. 


of  Electrical  Engineering 


This  dissertation  was  submitted  to  the  Graduate  Faculty  of  the  College  of  Engineering 
and  to  the  Graduate  School  and  was  accepted  as  partial  fulfillment  of  the  requirements  for  the 
degree  of  Doctor  of  Philosophy. 


May  1991 


Winfred  M.  Phillips 
Dean,  College  of  Engineering 


Madelyn  M.  Lockhart 
Dean,  Graduate  School 


